# bodygen_advancing_towards_efficient_embodiment_codesign__cc0b46c8.pdf Published as a conference paper at ICLR 2025 BODYGEN: ADVANCING TOWARDS EFFICIENT EMBODIMENT CO-DESIGN Haofei Lu1, Zhe Wu1, Junliang Xing 1, Jianshu Li2, Ruoyu Li2, Zhe Li2, Yuanchun Shi1 1Department of Computer Science and Technology, Tsinghua University, 2Ant Group {luhf23,wu-z24}@mails.tsinghua.edu.cn, {jlxing,shiyc}@tsinghua.edu.cn {jianshu.l,ruoyu.li,lizhe.lz}@antgroup.com Embodiment co-design aims to optimize a robot s morphology and control policy simultaneously. While prior work has demonstrated its potential for generating environment-adaptive robots, this field still faces persistent challenges in optimization efficiency due to the (i) combinatorial nature of morphological search spaces and (ii) intricate dependencies between morphology and control. We prove that the ineffective morphology representation and unbalanced reward signals between the design and control stages are key obstacles to efficiency. To advance towards efficient embodiment co-design, we propose Body Gen, which utilizes (1) topology-aware self-attention for both design and control, enabling efficient morphology representation with lightweight model sizes; (2) a temporal credit assignment mechanism that ensures balanced reward signals for optimization. With our findings, Body achieves an average 60.03% performance improvement against state-of-the-art baselines. We provide codes and more results on the website: https://genesisorigin.github.io. 1 INTRODUCTION (b) Morphologies with Policies (a) Initial Designs Figure 1: Embodied Agents generated by Body Gen. Species in nature are blessed with millions of years to evolve for remarkable capacities to adapt to the environment (Pfeifer & Scheier, 2001; Vargas et al., 2014). Time has gifted them with perfect physical bodies for movement and navigation, powerful processors for centralized information processing, and effective actuators for rapid interaction with their surroundings. Inspired by this observation, embodiment co-design (Sims, 1994; Ha, 2019; Yuan et al., 2021; Wang et al., 2023), where a robot s morphology and control policy are optimized simultaneously, has gained increasing attention and demonstrates significant potential in various downstream fields, such as automated robot design and bio-inspired robot generation (Kriegman et al., 2020; Nakajima et al., 2018; Judd et al., 2019; Pan et al., 2021; Whitman et al., 2023). However, this task encounters extreme difficulties: (1) the morphology search space is quite vast and combinatorial, with each morphology corresponding to unique action and state spaces; (2) evaluating each candidate design requires an expensive roll-out to find its optimal control policy, which is almost unfeasible for the expensive computation. Traditional evolutionary strategies (Sims, 1994; Wang et al., 2018b; Zhao et al., 2020; Gupta et al., 2021b) address these challenges through mutation-based population optimization but suffer from inefficient sampling and scalability limitations, requiring large amounts of computation. While structural constraints like symmetry (Gupta et al., 2021b; Dong et al., 2023) reduce constant search complexity, such human priors may compromise functionality (Yuan et al., 2021). Alternative Corresponding Author Published as a conference paper at ICLR 2025 Temporal Credit Assignment Environment Evaluation Action Observation Direct Communication Routes with Self-Attention MOSAT MOSAT MOSAT MOSAT MOSAT Morphology & Observations Figure 2: Overview of Body Gen, which leverages an RL-based framework for joint evolving of morphology and control policy, and an attention-based network equipped with Topology Position Encoding (Topo PE) for centralized message processing. approaches employ modular GNN-based controllers (Huang et al., 2020; Yuan et al., 2021) for cross-morphology policy sharing, yet struggle with effective joint-level message aggregation (Kurin et al., 2020). In this work, we aim to further push the potential of embodiment co-design, by introducing a method that enables efficient generation of high-performance embodied agents while maintaining computational affordability. Here, we announce Body Gen, a reinforcement learning framework for efficient, environment-adaptive embodied agent generation. Inspired by recent co-optimization approaches (Yuan et al., 2021), Body Gen directly use auto-regressive transformers to generate an agent s morphology before executing environment interactions. Given an initial design (a.k.a a prompt), Body Gen can output optimal morphologies and corresponding controllers at the same time. In contrast to previous methods, Body Gen utilizes a joint-level self-attention mechanism to achieve direct message communication using transformers. We further propose a topology-aware positional encoding for effective, lightweight morphology representation. Additionally, we addresses the inherent reward imbalance between morphology design (zero-reward-guided) and control (richreward-guided) phases: our enhanced temporal credit assignment mechanism dynamically balances reward signals across both stages, enabling coordinated optimization. The parameters of the whole Body Gen model is less than 2M, and a high-performance embodied agent can be generated using a single Nvidia GPU within 30 hours. To summarize, our contributions are as follows: We propose Body Gen, an end-to-end reinforcement learning framework for efficient embodiment co-design. We design a Morphology Self-Attention architecture (Mo SAT) to provide joint-to-joint message transition, featuring our proposed Topological Position Encoding (Topo PE) for efficient morphology representation. We propose a temporal credit assignment mechanism that ensures balanced reward signals in the morphology design and control phases, thus facilitating co-design learning. Comprehensive experiments across various tasks demonstrate Body Gen s advantages against previous methods in terms of both convergence speed and performance. Body Gen achieves an average performance improvement of 60.03% against the state-of-the-art baselines. 2 RELATED WORK Universal Morphology Control Embodiment co-design requires controlling robots with changeable morphologies and adapting to their incompatible action and state spaces. Universal Morphology Control (UMC), which employs a shared network to control each actuator separately, presents a promising solution to this problem. To better perceive the topological structures of various morphologies, some methods (Pathak et al., 2019; Wang et al., 2018a; Huang et al., 2020) employed Graph Neural Networks (GNNs) to enable communication between neighboring actuators. Recent works Published as a conference paper at ICLR 2025 also use Transformers (Vaswani et al., 2017) to overcome the limitations of multi-hop information aggregation brought by GNNs (Kurin et al., 2020; Hong et al., 2021; Gupta et al., 2021a; Dong et al., 2022). Despite these advancements, several challenges remain. Many existing methods are limited to parametric variations of a limited number (e.g. 2-3) of predefined morphologies, whereas a comprehensive morphology-agnostic design space remains largely unexplored. Furthermore, most previous works do not fully leverage morphology information or only consider its simple form and even the usefulness of such information remains controversial (Kurin et al., 2020; Hong et al., 2021; Gupta et al., 2021a; Xiong et al., 2023). In this work, we prove that morphology information plays a crucial role, and the correctness of morphology representation significantly influences performance. To this end, we introduce a simple yet effective positional encoding technique, Topo PE, which facilitates message localization within the body and enhances knowledge sharing among similar morphologies. By representing the 2D topological structure with position embeddings, we explore the potential of autoregressive transformers for robotic design generation. Embodiment Co-design As for embodied artificial agents, control policy (Lillicrap et al., 2015; Schulman et al., 2015a; Haarnoja et al., 2018; Schulman et al., 2017; Lowrey et al., 2018) has been well studied in the reinforcement learning and robotics community, while another critical component, the physical form of the embodiment, is currently attracting more and more attention (Kriegman et al., 2020; Bhatia et al., 2021; Xu et al., 2021; Huang et al., 2024). Embodiment co-design aims to optimize a robot s morphology and control simultaneously and is considered a promising way to stimulate the embodied intelligence embedded in morphology. Previous methods (Sims, 1994; Wang et al., 2018b; Gupta et al., 2021b) typically utilize evolutionary search (ES) to learn directly within the vast design space, which unavoidably brings inefficient sampling and expensive computation. A line of works (Wang et al., 2018b; Gupta et al., 2021a) introduces more human morphology priors, such as symmetry, to reduce the search space. Yuan et al. (2021) proposes jointly optimizing a robot s morphology and control policy via reinforcement learning. This paper focuses on the RL-based approach for joint optimization for both morphology and control. We aim to establish a comprehensive framework for embodiment co-design, systematically addressing key obstacles against efficiency during training. 3 PRELIMINARIES Morphology Representation. The morphology of an agent can be formally defined as an undirected graph G = (V, E, Av, Ae), where each node v V represents a limb of the robot, and each edge e = (vi, vj) E represents a joint connecting two limbs. Av and Ae are two mapping functions that map the limb node v to its physical attributes Av : V Λv, and map the edge e = (vi, vj) to its joint attributes Ae : E Λe, respectively. Here Λv = {Λvi} is the limb attribute space, consisting of attributes Λvi like limb lengths, sizes and materials, and Λe = {Λei} is the joint attribute space consisting of attributes Λei like rotation ranges and maximum motor torques. Consequently, the design space D is defined on all valid robot morphologies G D. Co-Design Optimization. The fitness F of an agent represents its performance in a specific environment and is typically evaluated by rewards. In traditional control problems with a fixed morphology G0, we aim to optimize its control policy π towards the optimal π = arg maxπF(π, G0) for maximum fitness. In co-design problems, we not only optimize the control policy but also the morphology design simultaneously. This co-design process is formulated as a bi-level optimization problem: G = arg max G F(π G, G) s.t. π G = arg max πG F(πG, G), (3.1) where the inner loop defines the optimal control policy of a given morphology, and the outer loop defines the optimal morphology using its optimal policy. Previous works typically use evolutionary algorithms (Sims, 1994; Wang et al., 2018b; Gupta et al., 2021b) to solve this problem. In this work, Body Gen leverages an RL-based framework and jointly optimizes both loops: π ( |G ), G = arg max π( |G),G F(π( |G), G), (3.2) using the universal control policy π( |G), to facilitate knowledge sharing among agents with similar morphologies. Published as a conference paper at ICLR 2025 (a) Mo SAT Architecture (b) Batch Processing for Mo SAT Figure 3: The Morphology Self-Attention (Mo SAT) architecture. (a) The sensor observations from different limbs are projected to hidden tokens for centralized processing with several Mo SAT blocks and generate separate actions. (b) The Mo SAT network processes different morphologies in a batch manner and learns a universal control policy π( |G), thus improving training efficiency. Reinforcement Learning. We define the problem formulation of Morphology-Conditioned Reinforcement Learning for embodiment co-design. We consider the augmented Markov Decision Process (MDP), which can be described by a 6-element tuple M = (S, A, T , R, D, Φ). Φ is a flag to distinguish design and control stages. S denotes the state space. A(Φ) represents the action space, where a A(Φ = Design) changes the morphology of the agent, and A(Φ = Control) defines the action space for motion control. T : S A(Φ) S [0, 1] represents the environmental transitioning probability from one state st to another st+1, given an action at. R : S A S R is the state-action reward, and the fitness function F is defined as the episodic return PT t=1 rt(st, at) based on rewards. As defined above, D represents the morphology design space, and our goal is to find some co-design policy π : S D A that can maximize the environmental fitness F. The co-design process consists of two sequential stages. (1) In the Design Stage, an agent begins with an initial morphology, G0, and iteratively refines through a series of morphology transforming actions via a design policy πD, until it achieves the final design Gdone. In the subsequent (2) Control Stage, the agent interacts with the environment with its corresponding control policy πC. Body Gen addresses three key challenges that hinder co-design efficiency: (1) Message Transmission Decay, which occurs when multi-hop communication fails to effectively propagate information to distant limbs (Kurin et al., 2020). Body Gen leverages self-attention for both auto-regressively body building and centralized body control using transformers. (2) Ineffective Morphology Representation (Yuan et al., 2021; Hong et al., 2021). Body Gen employs a simple yet effective topology position encoding mechanism better to align similar morphologies for knowledge sharing between them. (3) Unbalanced Reward Signals. Body Gen utilizes a temporal credit assignment mechanism to ensure balanced reward signals between different co-design stages. 4.1 ATTENTION-BASED CO-DESIGN NETWORK Body Gen divides the Design Stage into two sub-stages: Topology Design Stage and Attribute Design Stage, which transforms the topology (V0, E0) and the corresponding attributes (Av 0, Ae 0) of the agent s morphology, respectively. Consequently, the design policy πD is also divided into two sub-policies πD = (πtopo, πattr) for according action control. Published as a conference paper at ICLR 2025 During the Topology Design Stage, the agent can modify the topology through three basic actions: (1) Addition: add a new child limb vnew to v, along with a new joint enew = (v, vnew) connecting them. (2) Deletion: delete the limb v and the joint to its parent e = (vp, v) if v is a leaf node. (3) No Change: take no changes for node v. The agent s policy πtopo is conditioned on the current topology (V, E) of timestep t, denoting as the product of action distributions πtopo v from all limbs 1: atopo πtopo(atopo|G) Y v V πtopo v (atopo v |pv, G) (4.1) where pv represents the topology position of node v. In the Attribute Design Stage, the agent further generates limb and joint attributes based on the given topology Gdone = (Vdone, Edone). The agent s attribution policy πattr can be formulated as: aattr πattr(aattr|G) Y v V πattr v (aattr v |pv, G) (4.2) Finally, in the Control Stage, the agent uses the morphology generated in the Design Stage to interact with the environment using the control policy πC. actrl πctrl(actrl|s, Gdone) Y v V πctrl v (actrl v |s, pv, Gdone), (4.3) where s = {sv} denotes the sensor states of every limp, including forces, positions, velocities, etc. We use actrl v to represent the torque of the joint connecting node v with its parent vp. During the co-design process, we aim for the policy network to accommodate evolving morphologies in a way that offers two key advantages: (1) a single agent can maintain unified control even as the robot s body grows, preserving consistency across different designs, and (2) direct point-to-point communication between joints allows for richer information exchange, enabling more coordinated actions throughout the entire system. Inspired by the centralized signal processing of mammals in real-world nature, we propose the Morphology Self-Atten Tion architecture (Mo SAT) for efficient, centralized message processing. Figure 3 (a) provides an overview of Mo SAT. Latent Projection. We encode information from each limb s sensor to enhance network processing capabilities and map it into a latent space as message tokens. Specifically, limb sensor states sv are first processed through a parameter-shared linear mapping layer ϕh( ): m = ϕh(s) + Epos(V, E) s RL d, m RL D (4.4) where d is the input state dimension and D is the hidden dimension. We employ our proposed Topo PE for morphology representation, which will be further discussed later in Section 4.2. The position encodings ev are added to message tokens mv to get position-embedded message tokens. Centralized Processing. As illustrated in Figure 2, we aim for efficient message interaction. Body Gen utilizes the scaled dot-product self-attention Attention( ) for point-to-point, centralized processing. Specifically, each message m use qvi to query the key of another message kvj weighting its value vvi: Attention(m) = Soft Max(QKT dk )V where Q = m WQ, K = m WK, V = m WV , (4.5) where WQ, WK, WV RD D are learnable matrices. For Mo SAT block design, we adopt Pre-LN (Xiong et al., 2020) for layer normalization and add residual connections (He et al., 2016; Dosovitskiy et al., 2020). Forwarding. In the end, we need to output actions for each actuator. We decode the attended messages using a linear projector ϕa( ) to generate the action logits for each actuator: π(a|s) = Soft Max(ϕa(m N)), Discrete Action Space N(a; ϕa(m N), Σ), Continuous Action Space. (4.6) 1We use the limb-level action distribution, where each limb corresponds to its own action distribution, and the entire agent s action distribution is composed of all limbs distributions. This effectively resolves the incompatibility of state and action spaces across the changeable topological morphologies. Published as a conference paper at ICLR 2025 1 0 0 0 1 3 0 0 1 2 0 0 1 2 1 0 1 0 0 0 1 3 0 0 1 2 0 0 1 2 1 0 1 2 2 0 (c) Topo PE (Ours) (b) Traversal PE (a) Morphology Changes Figure 4: The motivation of our proposed topology-aware position encoding Topo PE. (a) During the co-design procedure, the agent s morphology keeps changing. (b) A typical traversal-based PE in previous works resulted in inconsistency across mythologies. (c) Topo PE can better adapt to similar morphology structures using a reasonably alignable manner. where N is the stacked block number of the attention layer. The above process has equipped Mo SAT with the capability to handle various morphologies. As shown in Figure 3 (b), to maximize training efficiency, we further offer Mo SAT the ability to process multiple morphologies in a batch mode. We provide more implementation details in Appendix A.3.3. 4.2 TOPOLOGY-AWARE POSITION ENCODING FOR MORPHOLOGY REPRESENTATION The vanilla attention operation treats each token equally, neglecting morphology information. However, it is crucial to inject positional information during embodiment co-design, for: (1) Similar information from different morphology positions has varying meanings and message source localization is significant; (2) Similar morphology structures may share similar local control policies, and positional information facilitates knowledge alignment and sharing across different agents. To better capture the differences between morphological structures and share structural knowledge among similar morphologies, we propose Topology Position Encoding (Topo PE), a topology-aware position encoding mechanism to handle the above two issues efficiently. As demonstrated in Figure 4, for traversal-based limb indexing methods (Hong et al., 2021; Gupta et al., 2021a; Xiong et al., 2023) slight morphological changes can cause global indexing offsets. To mitigate the effect of offsets due to morphological changes, Topo PE uses a hash-map H( ) for position encoding, which maps the path between the root limb vroot and the current limb vi to a unique embedding evi: evi = H([vi 7 vroot]) where [vi 7 vroot] = [(vi, p(vi)), (p(vi), p2(vi)), ..., (pl 1(vi), vroot)], (4.7) where pn(v) is the n-th ancestor of v. Practically, if v is the k-th child of its parent p(v), the edge (v, p(v)) is denoted by the integer k, allowing the path index to be represented as a sequence of integers. During the Topology Design Stage, Body Gen generates the robot s topology autoregressively (Figure 2). The topology created at each step is passed to Mo SAT in the following step, where newly added limbs are automatically registered and assigned their Topology Position Embedding. Meanwhile, during the Attribute Design Stage and the Control Stage, this final topology remains fixed. Experiments demonstrate that Topo PE effectively adapts growing morphologies, facilitating knowledge alignment and sharing across agents, which leads to better performance. 4.3 CO-DESIGN OPTIMIZATION WITH TEMPORAL CREDIT ASSIGNMENT 𝑎! 𝑎" 𝑎# 𝑎$ 𝑎% 𝑣 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 (a) Policy Network (b) Value Network Figure 5: Body Gen leverages an actor-critic paradigm for policy optimization. To achieve efficient reward-driven co-design, Body Gen leverages an actor-critic paradigm based on reinforcement learning, which trains a value function Vθ(st) and a policy function πθ(at|st) and updates them using collected trajectories. We employ the Proximal Policy Optimization (PPO) (Schulman et al., 2017) to optimize the policy πθ in the actorcritic framework. PPO uses the advantage function Published as a conference paper at ICLR 2025 ˆAt(at, st) to define how better an action at is for current state st, and optimizes the following surrogate objective function as: Lpolicy = min n πθ(at|st) πθold(at|st) ˆAt, clip πθ(at|st) πθold(at|st), 1 ϵ, 1 + ϵ ˆAt o . (4.8) In the co-design process, vanilla PPO exhibits limited performance. Only the Control Stage directly receives environmental rewards, while theoretically, a body-modifying action in the Design Stage influences all future timesteps, whereas a motion-control action in the Control Stage has a diminishing impact over time. To address this, we decouple the MDPs for body and policy optimization, linking them through a modified Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) for improved temporal credit assignment: ( δt + γλ ˆAt+1 (1 Tt Ct), for Control Stage Ut Vθ(st), for Design Stage where δt = rt + γVθ(st+1) (1 Tt) Vθ(st) Ut = rt + Ut+1 (1 Tt Ct), where γ is the discounting factor, λ is the exponentially weighted for GAE and Tt, Ct are two environment flags denoting environment termination and truncation, respectively. This decoupling enables us to apply distinct optimization algorithms to each stage, potentially improving overall performance. The value loss function Lvalue is defined as: Lvalue = Vθ(st) ˆRt 2, where ˆRt = sg Vθ(st) + ˆAt , (4.10) where sg[ ] stands for the stop-gradient operator. During the transition from the Design Stage to the Control Stage, we shift from a GPT-style (Radford et al., 2019) approach to a BERT-style (Devlin et al., 2018) framework. Specifically, the token output of each limb is used to generate the action policy for its corresponding actuator (Equation 4.6), as illustrated in Figure 5(a). Meanwhile, the token output of the root limb is used for value prediction of the entire body at timestep t (Figure 5(b)). To prevent conflicts in gradient descent (Yu et al., 2020; Liu et al., 2021) arising from different credit assignment strategies, each stage in the co-design process is equipped with a separate value network. 5 EXPERIMENTAL EVALUATIONS Our experiments aim to validate our primary hypothesis: that efficient message and reward delivery can effectively overcome bottlenecks in the co-design process, leading to embodied agents that can better adapt to the environment. Additional visualization results are presented in Appendix A.8. Visit our project website for more visualization results: https://genesisorigin.github.io. Environments. We conduct a comprehensive evaluation of Body Gen with baselines in ten challenging co-design environments (CRAWLER, TERRAINCROSSER, CHEETAH, SWIMMER, GLIDERREGULAR, GLIDER-MEDIUM, GLIDER-HARD, WALKER-REGULAR, WALKER-MEDIUM and WALKER-HARD) on Mu Jo Co (Todorov et al., 2012). These environments encompass diverse physical world types (2D, 3D), environment tasks, search space complexities, ground terrains, and initial designs to provide a multilevel evaluation. See Appendix A.1 for detailed descriptions. 5.1 COMPARISON WITH BASELINES We compare Body Gen with the following baselines to highlight Body Gen s performance: 1) Evolution Based Algorithms: NGE (Wang et al., 2018b) maintains a population of agents with different morphologies for random mutation and only preserves top-performing agents children for further optimization. 2) RL Based Algorithms: Transform2Act (Yuan et al., 2021) propose to optimize a robot s morphology and control concurrently through reinforcement learning and achieve cooptimization. It utilizes graph neural networks (GNNs) and joint-specific MLPs (JSMLP) to foster knowledge sharing and specification. 3) Universal Control Algorithms: UMC-Message (Wang et al., 2018a; Huang et al., 2020) leverages a localized message transition mechanism for information Published as a conference paper at ICLR 2025 Body Gen (Ours) Ours w/o Mo SAT Transform2Act Ours w/o Enhanced-TCA UMC-Message* Figure 6: Performance of Body Gen, NGE, Transform2Act, UMC-Message*, Body Gen w/o Mo SAT, and Body Gen w/o Enhanced-TCA on ten co-design environments, with error regions to indicate Standard Error over four random seeds. exchange within the body, which is a typical method for universal morphology control. To make it suitable for embodiment co-design, we equip it with a policy network and our enhanced temporal credit assignment using reinforcement learning and denote it as UMC-Message*. The implementation details and full hyper-parameter of Body Gen and all baselines are provided in Appendix A.3 and Appendix A.4. As shown in Figure 6, Body Gen achieves the highest task performance in all ten environments, with faster convergence speeds than baselines. Unlike the Universal Morphology Control (UMC) task, which focuses on limited specific morphologies (Wang et al., 2018a; Huang et al., 2020), embodiment co-design deals with various changeable, morphology-agnostic robots. Consequently, UMC-Message fails to converge within a limited time for complex tasks such as GLIDER-HARD, and WALKER-HARD, due to its insufficient knowledge alignment mechanism for complicated, changable morphologies (e.g. JSMLP in Transform2Act and Topo PE in Body Gen). Compared to evolutionary algorithms like NGE, we also find that RL-based methods demonstrate significant performance advantage due to a great sampling efficiency improvement within the same number of environmental interactions, supported by Yuan et al. (2021). By overcoming the bottlenecks in co-design, our approach goes even further: it achieves an average 60.03% performance improvement over the strongest baseline in all the ten tasks. 5.2 ABLATION STUDIES As mentioned in Section 4, our approach addresses inefficiencies in message and reward delivery, which includes the intra-agent level, inter-agent level, and agent-environment level. To better support our hypothesis and understand the importance of our key corresponding components (Mo SAT, Topo PE, Enhanced-TCA), we designed four variants of our approach: (i) Ours w/o Mo SAT, which removes the Mo SAT structure to remove our attention-based centralized information processing across different limbs; (ii) Ours w/o Enhanced-TCA, which removes our temporal credit assignment mechanism and employs original PPO for optimization; (iii) Ours w/o Topo PE, which removes Topo PE from our methods. For a more comprehensive comparison, we also introduced another position encoding method from recent UMC methods, as: (iv) Ours w/ Published as a conference paper at ICLR 2025 Table 1: Comparison of different position encoding choices for morphology representation. The reported values are Mean Standard Error over four random seeds. Methods CRAWLER TERRAINCROSSER CHEETAH SWIMMER GLIDER-REGULAR Topo PE (ours) 10381.96 353.97 5056.01 703.57 11611.52 522.86 1305.17 15.25 11082.29 99.21 w/ Traversal PE 8582.24 987.44 4339.60 260.60 10581.62 846.69 1292.05 16.71 9801.31 748.13 w/o Topo PE 7490.83 267.70 1122.29 659.38 7451.37 2275.37 1371.20 30.74 10137.83 713.60 Methods GLIDER-MEDIUM GLIDER-HARD WALKER-REGULAR WALKER-MEDIUM WALKER-HARD Topo PE (ours) 11996.82 595.51 10798.06 298.39 12062.49 513.07 12962.08 537.34 11982.07 520.78 w/ Traversal PE 10758.70 401.90 9106.77 679.59 10389.40 1080.94 10972.13 584.04 11255.89 121.04 w/o Topo PE 4099.99 2057.92 109.48 10.03 10149.67 255.99 6730.01 705.06 6529.87 1863.59 Traversal PE, where Topo PE is replaced with a traversal-based position embedding (Hong et al., 2021; Gupta et al., 2021a). Figure 6 presents the ablation studies for Topo PE and Enhanced-TCA, while Table 1 highlights the differences for different positional embedding choices. Additional detailed experimental results are available in the Appendix (Table 11, Table 12). (1) Intra-agent level: The Mo SAT module provides centralized information processing. Removing this module results in significant performance degradation. Transform2Act adds an MLP to each limb, enhancing local message processing and model performance, but it increases the model size to 19.64M, which grows linearly with the complexity of the morphology. In contrast, Body Gen is more lightweight, with each model only with 1.43M parameters. We provide model parameters of Body Gen and baselines in Table 2. (2) Inter-agent level: Topo PE facilitates morphological knowledge sharing among agents, aiding in adjusting knowledge for similar morphologies and reducing redundant learning costs. Compared to "Traversal PE" and "w/o Topo PE", Topo PE enhances agent performance and stabilizes learning. (3) Agent-environment level: Our proposed temporal credit assignment ensures that an agent receives balanced reward signals during both morphology design and control phases, markedly improving final performance across all the environments for embodiment. Table 2: Model parameters of Body Gen and baselines. Note: For NGE, the total number of models required is calculated as 20 + 20 0.15 125 = 395 (population_size + population_size elimination_rate generations). The total parameters are derived with population_size only. Models Agent Parameters Population Size Total Parameters Body Gen (Ours) 1.43 M 1 1.43 M Transform2Act 19.64 M 1 19.64 M UMC-Message* 0.27 M 1 0.27 M NGE 0.27 M 20 5.4 M 6 CONCLUSIONS AND LIMITATIONS This work proposes Body Gen, an end-to-end reinforcement learning framework for efficient embodiment co-design. Our approach delivers efficient messages and rewards through zero-decay message processing, effective morphological knowledge sharing, and balanced temporal credit assignment. Experiments demonstrate that Body Gen surpasses previous convergence speed and final performance methods while being efficient, lightweight, and scalable. Limitations and Future Work. We acknowledge at least two limitations. Firstly, our approach remains focused on simulation environments, and further efforts are needed to transfer learned strategies to real physical systems. Secondly, our reward-driven reinforcement learning method focuses on improving control effects. Yet, it cannot simulate the rich perception and execution capabilities of real biological intelligent systems. In future research, we expect embodied intelligence to evolve perception and execution components akin to biological evolutionary principles, realizing more efficient tasks for embodied intelligence. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENT This work was supported in part by the Natural Science Foundation of China under Grant No. 62222606 and the Ant Group Security and Risk Management Fund. Jagdeep Bhatia, Holly Jackson, Yunsheng Tian, Jie Xu, and Wojciech Matusik. Evolution gym: A large-scale benchmark for evolving soft robots. Advances in Neural Information Processing Systems, 34:2201 2214, 2021. Brain for AI Fandom. Central nervous system, 2024. URL https://brain-for-ai.fandom. com/wiki/Central_nervous_system. Accessed: 2024-05-21. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Heng Dong, Tonghan Wang, Jiayuan Liu, and Chongjie Zhang. Low-rank modular reinforcement learning via muscle synergy. Advances in Neural Information Processing Systems, 35:19861 19873, 2022. Heng Dong, Junyu Zhang, Tonghan Wang, and Chongjie Zhang. Symmetry-aware robot design with structured subgroups. In International Conference on Machine Learning, pp. 8334 8355, 2023. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Encyclopaedia Britannica. Diffuse nervous system, 2024. URL https://www.britannica. com/science/nervous-system/Diffuse-nervous-systems. Accessed: 2024-0521. Agrim Gupta, Linxi Fan, Surya Ganguli, and Li Fei-Fei. Metamorph: Learning universal controllers with transformers. In International Conference on Learning Representations, 2021a. Agrim Gupta, Silvio Savarese, Surya Ganguli, and Li Fei-Fei. Embodied intelligence via learning and evolution. Nature communications, 12(1):5721, 2021b. David Ha. Reinforcement learning for improving agent design. Artificial Life, 25(4):352 365, 2019. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870, 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Sunghoon Hong, Deunsol Yoon, and Kee-Eung Kim. Structure-aware transformer policy for inhomogeneous multi-task reinforcement learning. In International Conference on Learning Representations, 2021. Suning Huang, Boyuan Chen, Huazhe Xu, and Vincent Sitzmann. Dittogym: Learning to control soft shape-shifting robots. ar Xiv preprint ar Xiv:2401.13231, 2024. Wenlong Huang, Igor Mordatch, and Deepak Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. In International Conference on Machine Learning, pp. 4455 4464, 2020. Published as a conference paper at ICLR 2025 Euan Judd, Gabor Soter, Jonathan Rossiter, and Helmut Hauscr. Sensing through the body - noncontact object localisation using morphological computation. In Proceedings of the IEEE International Conference on Soft Robotics, pp. 558 563, 2019. Sam Kriegman, Douglas Blackiston, Michael Levin, and Josh Bongard. A scalable pipeline for designing reconfigurable organisms. Proceedings of the National Academy of Sciences, 117(4): 1853 1859, 2020. Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control. In International Conference on Learning Representations, 2020. Boyu Li, Haoran Li, Yuanheng Zhu, and Dongbin Zhao. Mat: Morphological adaptive transformer for universal morphology policy learning. IEEE Transactions on Cognitive and Developmental Systems, 16(4):1611 1621, 2024. doi: 10.1109/TCDS.2024.3383158. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. volume 34, pp. 18878 18890, 2021. Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. ar Xiv preprint ar Xiv:1811.01848, 2018. Kohei Nakajima, Helmut Hauser, Tao Li, and Rolf Pfeifer. Exploiting the dynamics of soft materials for machine learning. Soft robotics, 5(3):339 347, 2018. Xinlei Pan, Animesh Garg, Animashree Anandkumar, and Yuke Zhu. Emergent hand morphology and control from optimizing robust grasps of diverse objects. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 7540 7547, 2021. Deepak Pathak, Christopher Lu, Trevor Darrell, Phillip Isola, and Alexei A Efros. Learning to control self-assembling morphologies: a study of generalization via modularity. Advances in Neural Information Processing Systems, 32, 2019. Rolf Pfeifer and Christian Scheier. Understanding intelligence. 2001. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015a. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. ar Xiv preprint ar Xiv:1803.02155, 2018. Karl Sims. Evolving virtual creatures. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp. 15 22, 1994. Published as a conference paper at ICLR 2025 Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033, 2012. Brandon Trabucco, Mariano Phielipp, and Glen Berseth. Anymorph: Learning transferable polices by inferring agent morphology. In International Conference on Machine Learning, pp. 21677 21691. PMLR, 2022. Patricia A Vargas, Ezequiel A Di Paolo, Inman Harvey, and Phil Husbands. The horizons of evolutionary robotics. 2014. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. Encoding word order in complex embeddings. ar Xiv preprint ar Xiv:1912.12333, 2019. Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. In International conference on learning representations, 2018a. Tingwu Wang, Yuhao Zhou, Sanja Fidler, and Jimmy Ba. Neural graph evolution: Towards efficient automatic robot design. In International Conference on Learning Representations, 2018b. Yuxing Wang, Shuang Wu, Tiantian Zhang, Yongzhe Chang, Haobo Fu, Qiang Fu, and Xueqian Wang. Preco: Enhancing generalization in co-design of modular soft robots via brain-body pre-training. In Conference on Robot Learning, pp. 478 498, 2023. Julian Whitman, Matthew Travers, and Howie Choset. Learning modular robot control policies. IEEE Transactions on Robotics, 2023. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524 10533, 2020. Zheng Xiong, Jacob Beck, and Shimon Whiteson. Universal morphology control via contextual modulation. In International Conference on Machine Learning, pp. 38286 38300, 2023. Jie Xu, Andrew Spielberg, Allan Zhao, Daniela Rus, and Wojciech Matusik. Multi-objective graph heuristic search for terrestrial robot design. In 2021 IEEE international conference on robotics and automation, pp. 9863 9869, 2021. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. volume 33, pp. 5824 5836, 2020. Ye Yuan, Yuda Song, Zhengyi Luo, Wen Sun, and Kris M Kitani. Transform2act: Learning a transform-and-control policy for efficient agent design. In International Conference on Learning Representations, 2021. Allan Zhao, Jie Xu, Mina Konakovi c-Lukovi c, Josephine Hughes, Andrew Spielberg, Daniela Rus, and Wojciech Matusik. Robogrammar: graph grammar for terrain-optimized robot design. ACM Transactions on Graphics, 39(6):1 16, 2020. Published as a conference paper at ICLR 2025 The supplementary material provides additional results, discussions, and implementation details. Our code is available in our supplementary material for reproduction and further study. Visit our website for videos and more additional visualizations. A.1 ENVIRONMENT AND TASK DETAILS Crawler Terrain Crosser Cheetah Swimmer Glider Walker Figure 7: Randomly generated agents in six different environments for visualization. Purple ground indicates agents in a 3D physical world, Green ground represents agents in the xy-plane physical world, Blue ground denotes agents in the xz-plane physical world, and Brown ground denotes a physical world with variable terrain. Initial Type-1 Initial Type-2 Initial Type-3 Initial Type-4 Figure 8: Visualization of four initial designs in the environments. Type-1 consists of a structure with four limbs. Type-2 and Type-3 each includes two limbs connected by a joint, located in the xy-plane and xz-plane respectively. Type-4 comprises three limbs connected by two joints. Note that Body Gen can support almost arbitrary initial designs and is not limited to specified types. This section provides additional descriptions of the environments and tasks used in our experiments. Figure 7 displays randomly generated agents in six different environments. The first four environments: CRAWLER, TERRAINCROSSER, CHEETAH, and SWIMMER are derived from previous work (Yuan et al., 2021) to ensure a fair comparison. We have also introduced two additional environments, GLIDER and WALKER, to broaden the testing scope and provide a more comprehensive algorithm evaluation. Each agent consists of multiple limbs connected by joints, each equipped with a motor for controlling movement. Sensors within the limbs monitor positional coordinates, velocity, and angular velocity. Each limb s attributes include limb length and limb size. Each joint s attributes cover rotation range and maximum motor torque. Each episode starts with a simple initial design, as demonstrated in Figure 8. The agent evolves to its final morphology through a series of topological and attribute modifications. Meanwhile, the control policy is required to optimize concurrently. Published as a conference paper at ICLR 2025 Crawler The agent inhabits a 3D environment with flat ground at z = 0. The initial design is the Type-1 in Figure 8. Each limb can have up to two child limbs, except for the root limb. The height and 3D world velocity of the root limb are also included in the environment state. The reward function is defined as: rt = |px t+1 px t | t w 1 u Vt ae u 2 (A.1) where w = 0.0001 is the weighting factor for the control penalty term, J is the total number of limbs, and t = 0.04. Terrain Crosser The agent evolves in a terrain-variable environment, where the terrain features varying height differences. The maximum height difference of the terrain is zmax = 0.5. The agent must navigate these gaps to move forward. The initial design is the Type-3 in Figure 8. The terrain is generated from a single-channel image, with different values representing different height rates. Each limb of the agent can have up to three child limbs. For the root limb, its height, 2D world velocity, and a variable encoding the terrain information are included in the environment state. The reward function is defined as: rt = |px t+1 px t | t , (A.2) where t = 0.008, and the episode is terminated when the root limb height is below 1.0. Cheetah The agent in this environment evolves with flat ground at z = 0. The initial design is the Type-3 in Figure 8. Each limb of the agent can have up to three child limbs. The height and 2D world velocity of the root limb are added to the environment state. The reward function is defined in Equation (A.2). The episode is terminated when the root height is below 0.7. Swimmer The swimmer is designed to cover snake-like creatures in the water. The agent evolves in water with a vis = 0.1 viscosity for water simulation. The initial design is the Type-2 in Figure 8. Each limb supports up to three child joints. The root limb s 2D world velocity is incorporated into the environment state. The reward function is the same as Terrain Crosser in Equation (A.2). Glider The agent in this environment evolves on flat ground. The initial design is the Type-4, as shown in Figure 8. In Glider, the agent s search depth is limited to three times that of the initial design, encouraging full exploration of a relatively shallow search space. We also provide three different task levels: regular, medium, and hard, where each limb of the agent can have up to one, two, or three child limbs. The reward function is defined in Equation (A.2). Walker The agent evolves on flat ground. The starting design is Type-4 in Figure 8. The search depth for the agent is capped at four times the initial design to promote thorough exploration within a comparatively shallow search space. Similarly, three levels of task difficulty are offered, which have the same meaning as described in Glider. The reward function is specified in Equation (A.2). Note that the reward functions are kept simple and consistent in all environments. Unlike common practices in Open AI-Gym (Brockman et al., 2016), we do not provide any additional reward priors (e.g.alive bonus) to facilitate learning, which presents higher requirements to the algorithm robustness. A.2 MOTIVATIONS This section will detail the motivations behind designing Mo SAT and Topo PE for embodiment co-design, aiming to provide further insights. A.2.1 CENTRALIZED MESSAGE PROCESSING AND MOSAT As demonstrated in Figure 9, GNN-like neural systems are commonly found in simple organisms such as planarians, where sensory information is connected through neural networks for distributed and localized processing. In contrast, advanced creatures such as humans utilize a centralized signal processing approach, where signals from various body parts are centrally processed in the brain, leveraging scalability advantages similar to the self-attention mechanism within transformers. Figure 10 further illustrates the different message delivery mechanisms between GNN and Transformer. GNN uses aggregation and broadcasting for message transmission, resulting in progressive Published as a conference paper at ICLR 2025 (a) Localized Message Processing (b) Centralized Message Processing Neural System of Planarian Neural System of Human Graph Neural Network Transformer Neural Network Figure 9: Comparative overview of natural and artificial neural processing systems. (a) Localized message processing in planarians and GNNs. (b) Centralized message processing in human brains and Transformers. Relevant images are sourced from Encyclopaedia Britannica (2024); Brain for AI Fandom (2024). (a) Message via Aggregation and Broadcasting (b) Messaging via Self-Attention Step [1] Step [2] Step [3] Step [4] Figure 10: Comparison of different message delivery mechanisms between GNN and Transformer. information reduction. As demonstrated in Figure 10 (a), the dog-like robot needs to adjust its posture throughout its motion. The GNN s localized message processing approach requires signals from distant locations to propagate multiple times before reaching the target actuator. In contrast, Transformers can provide faster message transfer and interaction, by employing self-attention to facilitate direct point-to-point and point-to-multipoint message delivery. Inspired by this, we propose Mo SAT in Section 4.1. Mo SAT first maps sensor information to the latent space and leverage self-attention for signal interactions for centralized decision-making. Meanwhile, in GNNs, the message propagation mechanism allows for an implicit representation of morphology. However, while transformers leverage self-attention for direct message delivery, they do not offer an asymmetric information propagation mechanism to differentiate positions between different body parts. A.2.2 MORPHOLOGY POSITION EMBEDDING AND TOPOPE Position encoding has proven effective in location representation within the natural language processing field (Vaswani et al., 2017; Shaw et al., 2018; Raffel et al., 2020; Wang et al., 2019). Effectively representing the robot s morphology is crucial for co-designing morphology and control policies. In our work, we propose the Topology Position Embedding (Topo PE) to encode the morphology in a way compatible with Transformer-based architectures. Topo PE assigns a unique embedding to each limb based on its topological position within the robot s morphology tree. Specifically, the embedding index for a limb is derived from the path from the root node to the limb, capturing the structural relationships within the morphology. In previous works (Trabucco et al., 2022; Gupta et al., 2021a), morphology encodings often rely on traversal sequences like depth-first search (DFS) or manual naming (Trabucco et al., 2022; Li et al., 2024) conventions based on a full model of the robot. When limbs are removed to generate variants, the names of the remaining limbs remain unchanged, facilitating consistent encoding. However, in our setting, there is no predefined full model, and the robot s morphology is dynamically generated during co-design. Manually naming limbs is impractical in this context. Our Topo PE addresses this challenge using a topology indexing mechanism, which uses the path to the root as Published as a conference paper at ICLR 2025 the embedding index. This method naturally extends to dynamically changing morphologies and ensures that similar substructures share similar embeddings, promoting generalization across different morphologies. Moreover, unlike learnable position embeddings that are specific to particular morphologies, our approach can be extended using non-learnable embeddings, such as sinusoidal embeddings (Vaswani et al., 2017), which offer better extrapolation to unseen morphologies and eliminate the need for training the embeddings. To demonstrate the effectiveness of Topo PE, we conducted ablation studies comparing models with and without Topo PE. As shown in Table 1 and Figure 11, incorporating Topo PE significantly improves performance across various tasks. This indicates that Topo PE provides a more informative and stable encoding of the morphology, facilitating better learning of control policies. In contrast to other morphology-aware positional encodings, our Topo PE is specifically designed to handle dynamic and diverse morphologies without relying on a fixed full model or manual limb naming. Additionally, our approach aligns well with the Transformer architecture, allowing standard attention mechanisms to capture interactions between different limbs based on their topological relationships. A.3 IMPLEMENTATION DETAILS A.3.1 TRAINING DETAILS In line with standard reinforcement learning practices, we employed distributed trajectory sampling across multiple CPU threads to accelerate training. Each model is trained using four random seeds on a system equipped with 112 Intel Xeon Platinum 8280 cores and six Nvidia RTX 3090 GPUs. Our main code framework is based on Python 3.9.18 and Py Torch 2.0.1. For all the environments used in our work, it takes approximately only 30 hours to train a model with 20 CPU cores and a single NVIDIA RTX 3090 GPU on our server. A.3.2 HYPERPARAMETERS For Body Gen, we ran a grid search over Mo SAT layer normalization {w/o-LN, Pre-LN, Post-LN}, Policy network learning rate {5e 5, 1e 4, 3e 4}, Value network learning rate {1e 4, 3e 4}, and Mo SAT hidden dimension {32, 64, 128, 256}. We did not search further for the environmental settings, optimizer configurations, PPO-related hyperparameters, or the training batch and minibatch sizes. Instead, we strictly maintained consistency with previous works (Wang et al., 2018b; Yuan et al., 2021; Kurin et al., 2020) to ensure a fair comparison. With further hyperparameter tuning, our algorithm could achieve higher performance levels. Table 3 displays the hyperparameters Body Gen adopted across all experiments. For Transform2Act, we followed previous work (Yuan et al., 2021) and its official released code repository 2, and used Graph Conv as the GNN layer type, policy GNN size (64, 64, 64), policy learning rate 5e 5, value GNN size (64, 64, 64), value learning rate 3e 4, JSMLP activation function Tanh, JSMLP size (128, 128, 128) for the policy, MLP size (512, 256) for the value function, which were the best values they picked using grid searches. To make UMC-Message suitable for embodiment co-design, we equip them with a policy network and employ our temporal credit assignment via reinforcement learning. The network parameters and training settings are consistent with those used in Body Gen and Transform2Act to ensure a fair comparison. It adopted GNN layer type of Graph Conv, policy GNN size (64, 64, 64), policy MLP size (128, 128), policy learning rate 5e 5, value GNN size (64, 64, 64), value MLP size (512, 256), value learning rate 3e 4. We followed previous work (Huang et al., 2020) and also referred to the publicly released code 3 for implementation. For NGE, we follow previous works (Wang et al., 2018b; Yuan et al., 2021) according to the public release code 4, and used a number of generations 125, agent population size 20, elimination rate 0.15, 2https://github.com/Khrylx/Transform2Act 3https://github.com/huangwl18/modular-rl 4https://github.com/Wilson Wang THU/neural_graph_evolution Published as a conference paper at ICLR 2025 GNN layer type Graph Conv, MLP activation Tanh, policy GNN size (64, 64, 64), policy MLP size (128, 128), value GNN size (64, 64, 64), value MLP size (512, 256), policy learning rate 5e 5, and value learning rate 3e 4, which were the best searched values described by previous work. Table 3: Hyperparameters of Body Gen adopted in all the experiments Hyperparameter Value Number of Topology Design N topo 5 Number of Attribute Design N attr 1 Mo SAT Layer Normalization Pre-LN Mo SAT Activation Function Si Lu Mo SAT FNN Scaling Ratio r 4 Mo SAT Block Number (Policy Network) 3 Mo SAT Block Number (Value Network) 3 Mo SAT Hidden Dimension (Policy Network) 64 Mo SAT Hidden Dimension (Value Network) 64 Optimizer Adam Policy Learning Rate 5e-5 Value Learning Rate 3e-4 Clip Gradient Norm 40.0 PPO Clip ϵ 0.2 PPO Batch Size 50000 PPO Minibatch Size 2048 PPO Iterations Per Batch 10 Training Epochs 1000 Discount factor γ 0.995 GAE λ 0.95 A.3.3 THE BATCH MODE FOR MOSAT To maximize training efficiency, we further offer Mo SAT the ability to process multiple morphologies in a batch mode. For a batch of state inputs {st}B, we first pad them to equal length [st]B RB Lm d, where Lm is the max limb number of morphologies within this batch, and generate a padding matrix P RB Lm, where Pij = 1 for j Li and Pij = 0 for j > Li. To keep the messaging logic exactly equivalent to the regular mode, we can eliminate the influence of padding by modifying the attention operation with an attention mask Θ RB Lm Lm: Attention([mt]B) = Soft Max(QKT dk + Θ)V, (A.3) where Θijk = log(Pik + ϵ). Finally, we remove the batch padding and re-allocate actions to joints of different agents via: {a}B = [a]B P, where represents the bool-selection operation according to the padding matrix P. Published as a conference paper at ICLR 2025 A.4 ALGORITHM DETAILS Algorithm 1 illustrates the overall training process of Body Gen, which is based on PPO for efficient reinforcement learning. We highlight three key components: the interaction process, our temporal credit assignment based on GAE, and the main loop for iterative optimization. Algorithm 1: Synchronous Learning Algorithm for Body Gen Input: Replay Buffer B, Batch B, Optimizer optimizer Initialize :Policy networks: πθ : {πtopo θ , πattr θ , πctrl θ }; Value networks: Vθ : {V topo θ , V attr θ , V ctrl θ } B , B , Discount factor γ, GAE Exponential Weight λ 1 Function INTERACT(Policy: π, Replay Buffer: B): 2 while B not reaching max buffer size do 3 G0 initial design 4 Φ topo Topology design stage 5 for t = 0, 1, ..., N topo 1 do 6 atopo t πtopo(atopo t |Gt) Sample topology actions from all limps 7 Gt+1 apply atopo t to modify the topology (Vt, Et) of current design Gt 8 rt = 0 ; St = S; store {rt, , atopo t , Gt, St, 0, 0} into B Update Buffer B with transition 10 Φ attr Attribute design stage 11 for t = N topo, ..., N topo + N attr 1 do 12 aattr t πattr(aattr t |Gt) Sample attribute actions from all limps 13 Gt+1 apply aattr t to modify the attribute (Av t , Ae t) of current design Gt 14 rt = 0 ; St = S; store {rt, , aattr t , Gt, St, 0, 0} into B Update Buffer B with transition 16 Φ ctrl Control stage 17 st Env.Reset(0) st = {sv,t} denotes the sensor states from all limps 18 for t = N topo + N attr, ..., T 1 do 19 actrl t πctrl(actrl t |st, Gdone) 20 rt, st+1, Tt, Ct Env.Step(actrl t ) Tt, Ct denotes termination and trunction 21 St = S; store {rt, st, actrl t , Gt, St, Tt, Ct} into B Update Buffer B with transition 25 Function ENHANCEDGAE(Value Function: Vθ, Replay Buffer: B): 26 for t = T 1, ..., 0 do 27 Ut = rt + Ut+1 (1 Tt Ct) Calculate return 28 δt = rt + γVθ(st+1) (1 Tt) Vθ(st) Calculate the TD-error term 29 if St = ctrl then 30 ˆAt = δt + γλ ˆAt+1 (1 Tt Ct) Calculate advantage for the control stage 32 ˆAt = Ut Vθ(st) Calculate advantage for the design stage 34 ˆRt = Vθ(st) + ˆAt Calculate the target value 35 store { ˆAt, ˆRt} into B Append ˆAt and ˆRt to the corresponding transition item in B. 38 Function MAIN(): 39 while not reaching max iterations do 40 Thi Thread(INTERACT, πθ, B) We use multiple CPU threads for sampling 41 Thi.join() Gather trajectories collected from threads 42 ENHANCEDGAE(Vθ, B) Perform temporal credit assignment for co-design 43 while not reaching max epochs do 44 Update B B Sample a random batch B from Buffer B 45 Calculate PPO loss Lppo = Lpolicy + Lvalue According to Equation (4.8) and (4.10) 46 optimizer Gradient from Lppo Gradient descent to update πθ and Vθ Published as a conference paper at ICLR 2025 A.5 ADDITIONAL RESULTS A.5.1 QUANTITATIVE RESULTS Table 4: Comparison of Body Gen, its ablation variants, and baseline methods. Methods CRAWLER TERRAINCROSSER CHEETAH SWIMMER GLIDER-REGULAR Body Gen (Ours) 10381.96 353.97 5056.01 703.57 11611.52 522.86 1305.17 15.25 11082.29 99.21 - w/o Mo SAT 818.92 57.78 407.30 4.50 662.88 74.88 476.26 19.95 447.72 7.56 - w/o Enhanced-TCA 4994.44 160.14 2668.66 844.22 8158.74 55.71 786.32 19.39 8317.88 498.26 Transform2Act 4185.63 334.04 2393.84 692.96 8405.70 815.64 732.20 22.61 6901.68 374.42 NGE 1545.13 626.54 881.71 459.96 2740.79 515.51 395.90 173.85 1567.84 756.74 UMC-Message 6492.90 441.04 1411.51 705.68 5785.40 2110.77 961.20 183.03 7354.34 2145.22 Methods GLIDER-MEDIUM GLIDER-HARD WALKER-REGULAR WALKER-MEDIUM WALKER-HARD Body Gen (Ours) 11996.82 595.51 10798.06 298.39 12062.49 513.07 12962.08 537.34 11982.07 520.78 - w/o Mo SAT 489.75 5.74 533.17 14.20 555.33 18.15 708.32 12.72 827.33 47.71 - w/o Enhanced-TCA 7454.55 289.93 7592.03 1023.70 7286.30 735.55 6069.51 652.96 6126.73 572.85 Transform2Act 5573.44 519.22 6120.37 1380.74 8685.47 1008.88 6287.15 426.99 4645.31 294.81 NGE 1649.60 763.55 2339.90 487.22 1402.85 595.54 2600.39 481.74 1575.87 508.11 UMC-Message 4726.44 2406.35 425.49 141.02 5417.14 2019.43 5347.70 2397.85 2783.09 1587.06 As demonstrated in Figure 6, we present the full training curves for Body Gen with baselines including Transform2Act, UMC-Message, NGE, and ablation variants of ours w/o Mo SAT and ours w/o Enhanced-TCA across ten co-design environments. Each model was trained using four random seeds. For all baselines, we employed the best performance configurations reported by previous works, as is detailed in Section A.3. Table 4 further presents related metrics, with each cell showing the mean and standard deviation of episode rewards for the corresponding algorithm in each environment. A.6 ADDITIONAL ABLATION STUDIES ON TOPOPE AND ENHANCED-TCA We provide additional ablation studies on our proposed Topo PE and Enhanced-TCA to provide more insights, demonstrated in Figure 11 and Figure 12. GNN + Topo PE + Enhanced-TCA Mo SAT + Topo PE + Enhanced-TCA (Ours) GNN + Enhanced-TCA Mo SAT + Enhanced-TCA Figure 11: Extensive experiments on our proposed simple-yet-effective Topology Position Encoding (Topo PE) across different architectures of Mo SAT and GNN, validating Topo PE as an efficient and general method for morphology representation. (1) Mo SAT: w/o Topo PE with Topo PE; (2) GNN: w/o Topo PE with Topo PE; Both sets demonstrated the obvious performance improvements brought by Topo PE. Published as a conference paper at ICLR 2025 GNN + Topo PE + Enhanced-TCA Mo SAT + Topo PE + Enhanced-TCA (Ours) GNN + Topo PE Mo SAT + Topo PE Figure 12: Extensive experiments on our proposed Temporal Credit Assignment Mechanism (Enhanced-TCA) across different architectures of Mo SAT and GNN, validating Enhanced-TCA mechanism as an efficient method for enhancing bi-level optimization. (1) Mo SAT: w/o Enhanced TCA with Enhanced-TCA; (2) GNN: w/o Enhanced-TCA with Enhanced-TCA; Both sets demonstrated the obvious performance improvements brought by our Enhanced-TCA mechanism. A.7 COMPARISON OF BODYGEN S DESIGN SPACE WITH UNIMAL In addition to better position Body Gen, we also compare its design space and computational requirements to those of UNIMAL (Gupta et al., 2021b), a widely recognized framework for morphology design. Body Gen and UNIMAL (Gupta et al., 2021b) share similarities and differences in their approaches to morphology design, search space, and computational demands, providing insights into the trade-offs between these systems. We will compare them from several perspectives: Initial design. The search space of UNIMAL is similar with the design space in our "crawler" environment. Both Body Gen (in the crawler environment) and UNIMAL adopt an ant-like structure with a single body and four limbs extending in perpendicular directions, as the initial design G0. Morphology actions. UNIMAL offers three basic mutation operations: adding limbs, deleting limbs, and modifying limb parameters. Body Gen employs a similar set of actions but organizes them into two types: the topology design type, which includes adding limbs, deleting limbs, and passing, and the attribute design type, which focuses on modifying limb parameters. While these actions are conceptually aligned, Body Gen provides a more structured framework for exploring morphology changes. Search space. UNIMAL allows for a maximum of 10 limbs, whereas the "crawler" environment used by Body Gen supports up to 29 limbs, offering a significantly larger space for morphological exploration. This difference highlights Body Gen s broader scope in accommodating complex designs. Published as a conference paper at ICLR 2025 A.8 MORE VISUALIZATION RESULTS In this section, we provide additional visualization results for embodied agents generated by Body Gen across ten co-design environments. Crawler Terrain Crosser Cheetah Swimmer Glider-Regular Glider-Medium Glider-Hard Walker-Regular Walker-Medium Walker-Hard Figure 13: Visualization of embodied agents generated by Body Gen on different environments. Published as a conference paper at ICLR 2025 Figure 14: Visualization for Body Gen s attention map during the control process on Cheetah.