# actionconditioned_generation_of_bimanual_object_manipulation_sequences__b9ad0c1d.pdf Action-Conditioned Generation of Bimanual Object Manipulation Sequences Haziq Razali, Yiannis Demiris* Personal Robotics Lab, Dept. of Electrical and Electronic Engineering, Imperial College London {h.bin-razali20,y.demiris}@imperial.ac.uk The generation of bimanual object manipulation sequences given a semantic action label has broad applications in collaborative robots or augmented reality. This relatively new problem differs from existing works that generate whole-body motions without any object interaction as it now requires the model to additionally learn the spatio-temporal relationship that exists between the human joints and object motion given said label. To tackle this task, we leverage the varying degree each muscle or joint is involved during object manipulation. For instance, the wrists act as the prime movers for the objects while the finger joints are angled to provide a firm grip. The remaining body joints are the least involved in that they are positioned as naturally and comfortably as possible. We thus design an architecture that comprises 3 main components: (i) a graph recurrent network that generates the wrist and object motion, (ii) an attention-based recurrent network that estimates the required finger joint angles given the graph configuration, and (iii) a recurrent network that reconstructs the body pose given the locations of the wrist. We evaluate our approach on the KIT Motion Capture and KIT RGBD Bimanual Manipulation datasets and show improvements over a simplified approach that treats the entire body as a single entity, and existing whole-body-only methods. Introduction Modelling human and object motion given a semantic action label has broad applications in human-robot interaction (HRI) (Chao et al. 2015) or virtual and augmented reality (AR/VR) (Chac on-Quesada and Demiris 2022). In the context of HRI, being able to forecast the hand and object trajectories would allow the robot to respond in a timely manner while avoiding collisions. For AR/VR, predictive computation facilitates systems to plan ahead on rendering with increased buffer time. Existing work however, have focused on modelling only the human motion for whole-body actions such as run, jump, walk, etc (Guo et al. 2020; Petrovich, Black, and Varol 2021). The same set of methods cannot be deployed in a setting that involves human-object interaction as there exists some spatio-temporal correlation be- *Yiannis Demiris is supported by a Royal Academy of Engineering Chair in Emerging Technologies. Code at www.imperial.ac.uk/personal-robotics/software Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Wrist and Object Pose Module Finger Pose Figure 1: Given the action label a, and initial human and object pose, our network generates the entire manipulation sequence using 3 separate modules. tween the human and object motion e.g., how a person orients the bottle relative to a cup during a pour action. Our goal in this work is thus to address the relatively new task of action-conditioned generation of bimanual object manipulation, where we take a semantic action label such as Pour and generate a realistic sequence of both the human and object motion, from the moment the person approaches the objects and performs the pouring action, till the moment the objects are placed back on the table after completion. A first solution that can address said task would be to incorporate a densely connected graph where the nodes represent the human and object pose, and the edges their relation. However, one intuitive observation in the context of object manipulation is the varying degree each joint is involved during the action. We propose a network partitioned into three modules, each dedicated to the respective body parts or objects: (1) a graph recurrent network that models the wrist and object motion, (2) an attention-based recurrent network for the finger joints, and (3) another recurrent network for the body joints (Fig. 1). We show that this provides better performance versus representing the entire body pose as a single node. Next, existing works use a recurrent network built for action recognition to quantify the performance of their generative models. However, the absence of object relation in the recognition network makes it infeasible for use in our case. To address this, we propose a bidirectional graph recurrent network for bimanual action segmentation. In summary, our contributions are as follows: (1) We introduce a novel neural network for the action-conditioned The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) generation of bimanual object manipulation sequences. To the best of our knowledge, ours is the first that tackles the problem of action-conditioned pose generation in the context of object manipulation. (2) We propose a bidirectional graph recurrent network to evaluate the performance of generative models tasked to generate bimanual actions. Related Works Human Pose Forecasting. Recent works on 3D human pose forecasting differ mainly in their architecture, opting for either a deterministic model (Martinez, Black, and Romero 2017; Guo and Choi 2019; Corona et al. 2020) or injecting stochasticity (Liu et al. 2021; Yuan and Kitani 2020; Kundu, Gor, and Babu 2019) via Variational Autoencoders (VAE) (Kingma and Welling 2013) or Generative Adversarial Networks (Goodfellow et al. 2014) in order to predict multiple plausible futures. These works receive as input a sequence of the past pose to output either the joint positions (Martinez, Black, and Romero 2017; Li et al. 2018) or joint rotations (Fragkiadaki et al. 2015; Pavllo, Grangier, and Auli 2018) that are then converted to positions via forward kinematics. More recent works go beyond body joints to instead, output the full SMPL body model (Loper et al. 2015; Taheri et al. 2022). Early deterministic models tend to use recurrent networks such as Gated Recurrent Units (GRU) (Chung et al. 2014) or fully convolutional layers. Several other works incorporate additional context such as eye gaze (Razali and Demiris 2021) or the object coordinates (Razali and Demiris 2022; Taheri et al. 2022). Lastly, the context-aware model (Corona et al. 2020) forecasts both the human pose and object motion and is related to our work, although there exist several notable differences. First, their model relies on the past 1 second to predict the next 2 seconds and does not take in an input action label, meaning the predicted sequence relies purely on the past motion without any controllability. Second, their model predicts the full pose at every timestep. By contrast, our method receives only the input action label and initial positions to first generate the wrist and object motion, before reconstructing the full body pose including the finger joints, from start to finish, and is thus more akin to synthesis. Human Pose Synthesis. In contrast to human pose forecasting, methods developed for human pose synthesis typically do not receive a sequence of the past pose as input. Rather, the input may either be a zero-vector or the default standing pose. These models are then trained to generate the complete motion conditioned either on an audio signal (Li et al. 2021), a semantic action label (Guo et al. 2020; Petrovich, Black, and Varol 2021), or a sentence (Ahuja and Morency 2019). Autoregressive type methods (Guo et al. 2020) hold an advantage in that they can be easily repurposed for forecasting unlike the purely generative ones that accept only the input action label without the pose (Petrovich, Black, and Varol 2021; Ahuja and Morency 2019). Most similar to our work are Action2Motion (Guo et al. 2020) and Actor (Petrovich, Black, and Varol 2021). Action2Motion takes an action label to generate the human pose in an autoregressive manner using a VAE-GRU whereas Actor employs a VAE-Transformer (Vaswani et al. 2017) to generate the full sequence in one shot. A similarity shared by the abovementioned works is that they were built for whole-body motions such as run, walk, jump, etc without any object interaction which is our contribution in this paper. Method Given the one-hot action label a, initial object poses X0 = [x0 1, x0 2, ...x0 N] RN M 3 and their labels L = [l1, l2, ...l N] at time t = 0 for N objects represented by M points of the bounding box or motion capture markers, and the human pose P 0 RK 3 with K joints, our goal is to generate the complete human and object motion from the moment the person begins reaching for the object to perform the action, till the moment the objects are placed back on the table after completing said action. In short, we want to learn the expression p(P 1:T , X1:T |a, P 0, X0, L). However, not every joint of the human body is directly involved during bimanual object manipulation. The forearms, or more specifically the wrists act as the primary movers in reaching or moving the object throughout the action. Their movements may mirror each other e.g., when rolling dough with both hands, or uniquely, wherein one hand provides stability while the other makes precise movements e.g., when stirring the contents of a cup. The finger joints are then angled to ensure that the object is firmly held and oriented as required for the task such as the cutting of fruits. The remaining body joints are lastly positioned as naturally and comfortably as possible to perform the action. There is thus a higher degree of correlation between the objects and forearms. In light of this, we can partition the complete human pose P into the left and right wrists xl, xr, fingers F = [fl, fr], and remaining body joints b. The wrists share the variable x and are subsumed into X as they will be treated as objects from here onwards. We can then further factorize our initial objective into three components: log p(X1:T |a, X0, L) (1) + log p(F 1:T |X1:T , L) + log p(b1:T |x1:T l , x1:T r ) The result is more reflective of the degree of interaction that occurs during bimanual manipulation in that it first generates the wrist and object pose sequence given the action label p(X1:T |a, X0, L) before computing the required finger joint angles p(F 1:T |X1:T , L) at every timestep. The remaining body joints are then reconstructed independently of the object pose and finger joint angles p(b1:T |x1:T l , x1:T r ). Note that we assume the objects to already be within grasping distance. Figure 2 illustrates the framework of our system. In the following, we describe our method for all three modules. Wrist and Object Pose Module We formulate the task of wrist and object pose generation as a sequential modeling problem over a densely connected graph, where the wrists and objects are each represented by a vertex, and their locations recursively predicted over time. Specifically, we first define a graph G(V, E) at time t = 0 that connects each vertex to all its neighbours, where each Figure 2: Overview of our method. The wrist and object motion are generated by the graph network. The outputs are then sent to the finger and body pose modules to generate the respective body parts. We detach the data flowing to the finger and body pose modules to simplify the training. The architecture can thus be viewed as 3 modular components that can be trained separately. vertex vt i = [xt i, li] concatenates the coordinates and label, and the edge et ij = vt i vt j the difference between neighbours i and j. We then compute the representation for each vertex by running the graph through the edge-convolution (Wang et al. 2019) variant of the message passing scheme: gt i = max j N (i)ϕ1([vt i, et ij]) (2) = max j N (i)ϕ1([ϕ2([xt i, li]), ϕ2([xt j, lj]) ϕ2([xt i, li])]) where ϕ denote linear layers and [., .] a concatenation. This operation encodes the relation between vertex i and all its neighbours through the edge while maintaining information about what and where the node is through the vertex information. We then concatenate the one-hot action label a to the vertex feature gt i and provide them as input to a GRU and subsequently a linear layer to produce the parameters of a Gaussian distribution: ht i = GRU(ht 1 i , [gt i, a]) (3) µt i, σt i = ϕ3(ht i) (4) Intuitively, the GRU is tasked to generate the latent motion distribution for the object given its feature-wise proximity to all its neighbours and the action label. Lastly, we sample from the Gaussian distribution and concatenate the output to the object label, and send them to another linear layer to decode either the coordinates of the object or wrist. zt i N(µt i, σt i) (5) xt+1 i = xt i + ϕ4([zt i, li]) (6) Note that the object label li not only helps the vertex establish their relation to each other in equation 2 but is also useful in determining the coordinates of the bounding box corners or motion capture markers in equation 6 as it determines their size or positions relative to each other. The entire network is then run recursively until the variance of the wrist and object coordinates dip below a threshold, indicating that the action has been completed and the person has reverted to the default standing pose. Finger Pose Module As mentioned, the finger joint angles are a function of both the objects of interest, proximity between the wrists to said objects of interest, and action. Although information pertaining to proximity are encoded in the edge features, they need to be aggregated appropriately. To this end, we first attend to the most likely object given the action label: P = σ(ϕ5(L)ϕ6(a)T where d represents the dimension of the embedded variables and P the vector of object probabilities. We then use the probability scores to scale the edge features returned by the graph network and concatenate the result to the finger pose from the previous timestep for prediction. ht i = GRU(ht 1 i , [ X pi et ij, f t]) (8) f t+1 = ϕ7(ht i) (9) where a GRU is used to enforce temporal smoothness. Since the handedness of an individual affects the object selected by the left and right hand, we incorporate a separate set of weights for the operations above for each hand to induce a bias towards the object frequently selected by the respective hand as shown in Fig. 2. Our finger pose module works as a deterministic model as there is very little to no in-hand manipulation of the objects. The finger module can be easily augmented with a VAE if there is a need for stochasticity. Body Pose Module Existing work has shown that the head has a significant lead before motor actions only if the objects are not situated in front of the person within the field of view, nor is it within grasping distance (Land 2006). Because we assume the converse, we find it sufficient to simply concatenate the locations of the left and right wrists, and the body at the previous timestep as input to predict it at the next timestep. ht i = GRU(ht 1 i , [xt l, xt r, bt 1]) (10) bt = ϕ8(ht i) (11) Likewise, we find the GRU crucial in maintaining temporal smoothness. Loss Function Altogether, our architecture is trained end-to-end to minimize the following loss at every timestep: λ1|| ˆXt Xt||2 2 + λ2KL(q(Zt|X t, Z