# bihmpgan_bidirectional_3d_human_motion_prediction_gan__d196e1fb.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Bi HMP-GAN: Bidirectional 3D Human Motion Prediction GAN

Jogendra Nath Kundu, Maharshi Gor, R. Venkatesh Babu Video Analytics Lab, Department of Computational and Data Sciences Indian Institute of Science, Bangalore, India. jogendrak@iisc.ac.in, maharshigor18@gmail.com, venky@iisc.ac.in

Human motion prediction model has applications in various ﬁelds of computer vision. Without taking into account the inherent stochasticity in the prediction of future pose dynamics, such methods often converges to a deterministic undesired mean of multiple probable outcomes. Devoid of this, we propose a novel probabilistic generative approach called Bidirectional Human motion prediction GAN, or Bi HMP-GAN. To be able to generate multiple probable human-pose sequences, conditioned on a given starting sequence, we introduce a random extrinsic factor r, drawn from a predeﬁned prior distribution. Furthermore, to enforce a direct content loss on the predicted motion sequence and also to avoid mode-collapse, a novel bidirectional framework is incorporated by modifying the usual discriminator architecture. The discriminator is trained also to regress this extrinsic factor r, which is used alongside with the intrinsic factor (encoded starting pose sequence) to generate a particular pose sequence. To further regularize the training, we introduce a novel recursive prediction strategy. In spite of being in a probabilistic framework, the enhanced discriminator architecture allows predictions of an intermediate part of pose sequence to be used as a conditioning for prediction of the latter part of the sequence. The bidirectional setup also provides a new direction to evaluate the prediction quality against a given test sequence. For a fair assessment of Bi HMP-GAN, we report performance of the generated motion sequence using (i) a critic model trained to discriminate between real and fake motion sequence, and (ii) an action classiﬁer trained on real human motion dynamics. Outcomes of both qualitative and quantitative evaluations, on the probabilistic generations of the model, demonstrate the superiority of Bi HMP-GAN over previously available methods.

Introduction

Seamless interaction of robot or AI systems with urban environment dominated by human beings requires certain behaviour prediction abilities. In this work, the focus is on understanding the dynamics of human pose. For example, the ability to predict pedestrian behaviour in an urban road scene is very crucial for autonomous driving systems to

equal contribution - listed alphabetically by ﬁrst names Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Mean average prediction error on Human 3.6M for different motion prediction methods. The blue and green band for Bi HMP-GAN and HP-GAN respectively, show uncertainty in the prediction of future motion as compared to other deterministic approaches.

prevent potential accidents. Other examples include interaction of robots with humans; such as handshaking, catching or holding objects thrown by other person etc. Moreover the artiﬁcial systems must develop the ability to understand the general trends of human pose dynamics for effective and coherent interactions (Koppula and Saxena 2013). Humans develop such ability by observing actions or pose dynamics of other persons over time. Creating a system which models such diverse human actions is the prime motive towards achieving an efﬁcient human motion prediction model (Mainprice and Berenson 2013). The goal is to develop a model, which can predict plausible 3D human pose sequence from a given past dynamics of a certain time period. However, prediction of future pose sequence should not be modeled as a deterministic approach as there can be multiple plausible limb variations conditioned on the past motion dynamics. Since, the uncertainty in probable future pose increases with increase in time, a deterministic model cannot be considered reliable for long-term predictions. For example a person running may slow down to stop or keep running at a different speed. Although such variations are present in the available human motion dataset, some pose dynamics are more probable than other. Hence, the model should have the ﬂexibility to model such stochasticity in the prediction of future pose sequence. With the advent of deep learning for sequence-tosequence (Sutskever, Vinyals, and Le 2014) modeling, many recent works use variants of deep recurrent neural net-

works for human motion prediction and synthesis (Ghosh et al. 2017; Li et al. 2017). According to the analysis performed by Martinez, Black, and Romero (2017), earlier motion prediction methods (Taylor, Hinton, and Roweis 2007) show a catastrophic drift in the prediction of immediate future frame conditioned on past motion sequence. They proposed to solve it by utilizing the recurrent network to predict the residue on past frame instead of directly estimating the next frame parameters. However most of the recent works in human motion prediction (Li et al. 2018; Butepage et al. 2017) do not model the inherent stochasticity in the fore-casted pose sequence. In such scenario, the model predicts a deterministic undesired mean of multiple probable pose dynamics, which often leads to suboptimal performance. We address this issue by introducing a randomly sampled vector (or an extrinsic factor) along with the latent representation of encoded past frames - the intrinsic representation. We consider the combination of these two factors as the input to a generative decoder architecture. This makes our framework a truly probabilistic generative approach for human motion prediction. Recently, HP-GAN (Barsoum, Kender, and Liu 2017) proposed a similar approach by utilizing advances in generative adversarial network (GAN) to model human motion prediction as a generative modeling task. But the authors have not evaluated its performance against the available deterministic state-of-the-art methods. The focus should be on the performance metric of long-term prediction to rule-out the phenomenon of convergence to mean pose sequence, which is evident in deterministic motion prediction methods (Li et al. 2018). However, the generative setup incorporated by HP-GAN does not have the ﬂexibility for quality assessment of the generated motion. The challenge is to incorporate modiﬁcations in the probabilistic motion prediction model, which can offer a new direction to evaluate expressiveness of such frameworks for long-term prediction. It has been shown that, the quality of predictions by a pure encoder-decoder setup is much better than a variational counterpart, mostly because of the complex objective - to generate novel samples (or to learn a continuous latent space) - of the latter. Hence, there has been an increasing interest to incorporate a direct content loss (mean squared loss) on the available training samples even for generative modeling, as it ensures superior prediction quality alongside avoiding mode-collapse. Works like Chen et al. (2016) and Makhzani et al. (2015) incorporated autoencoder setup with generative adversarial objective to improve quality of generation with stabilized training regime. Motivated by this line of thought, unlike HP-GAN, the goal is to integrate direct content loss on the available full motion sequence (combined past and future frames) in the proposed conditional sequence generative framework. For each available full sequence, the proposed model should be able to predict the exact future sequence conditioned on the encoded past dynamics and some extrinsic latent representation. Moreover, as a given test sequence includes one of the plausible pose forecast dynamics, the latent random vector, along with modeling uncertainty in future prediction, must also be able to

represent the exact pose forecast dynamics with utmost efﬁciency. Unlike HP-GAN, the proposed generative framework incorporates a novel conditional discriminator architecture. Here the discriminator not only acts like a critic, discriminating actual pose dynamics from the predicted ones; but also regresses the randomly sampled extrinsic vector, which was used for the prediction of the corresponding future dynamics. Design of such discriminator has two prominent traits. Firstly, it avoids the inherent problem of mode-collapse as it attempts to learn a one-to-one invertible mapping between the extrinsic latent vector and the corresponding motion prediction. Secondly, it offers a new way to enforce direct content loss (similar to deterministic encoder-decoder framework) on the prediction of probabilistic decoder output (more details in Approach Section). Thus, by integrating this novel modiﬁcation to the discriminator architecture with an efﬁcient learning algorithm (See Algorithm 1), we are able to achieve superior motion prediction results as compared to previous methods. Such setup also provides a ﬂexibility to compare quality of long-term prediction against previous deterministic state-of-the-art approaches.

Related Works Data-driven human motion prediction models have been explored by researchers for quite a along time in both computer animation and machine learning community. Before the deep-era various probabilistic graphical models have been tried to efﬁciently model human motion dynamics. Researchers have used time-series learning methods like Hidden Markov Model (Arikan, Forsyth, and O Brien 2003), restricted Boltzmann machines (Taylor, Hinton, and Roweis 2007), Gaussian process (Wang, Fleet, and Hertzmann 2008), switching linear dynamical system (Pavlovic, Rehg, and Mac Cormick 2001) to model human pose sequence data. However these methods fail to model the highdimensional complex human pose sequence information effectively. Because of the highly nonlinear dependencies arose by the uncertainty in human movement, individually modeling various different factors affecting motion prediction does not scale well. These methods also suffer from complex training regime (Taylor, Hinton, and Roweis 2007) with complicated inference pipeline as a result of the acquired sampling technique. On the other hand, success of recurrent neural network (RNN) for modeling time-series data motivated researchers to effectively apply such architectures on human motion prediction task. Multitude of recent works (Fragkiadaki et al. 2015; Martinez, Black, and Romero 2017) successfully used variants of recurrent sequence-to-sequence architecture to model complex human skeleton dynamics. Such methods consider a seed motion sequence of certain time-step to condition prediction of future pose dynamics by employing an encoder-decoder recurrent pipeline. Ghosh et al. (2017) employ an additional non-recurrent encoder and decoder to explicitly leverage spatial structure and dependencies between joint locations to improve prediction quality of human pose sequence. Jain et al. (2016) proposed Structural-RNN to exploit the underlying spatio-temporal graph for modeling hu-

man skeleton dynamics. However all these methods do not consider the stochasticity in future pose dynamics by modeling it as a deterministic prediction problem. Hence, expressiveness of these approaches in modeling long-term motion sequence deteriorates as a result of convergence to a mean pose sequence. HP-GAN (Barsoum, Kender, and Liu 2017) ﬁrst attempted to model human motion prediction as a probabilistic generative approach. They leverage recent advances in generative adversarial network (GAN) (Goodfellow et al. 2014) to adversarially train a recurrent motion prediction framework. However, they fail to assess expressiveness of such generative approach against deterministic counterparts. In contrast, the proposed Bi HMP-GAN incorporates novel modiﬁcations in architecture and training regime to improve expressiveness of the probabilistic method against available deterministic approaches.

Approach We here describe the details of the proposed probabilistic human motion prediction framework. The sequence prediction model takes a stream of input pose frames, which is considered as past motion conditioning. Let X1:T = [x1, x2, ...x T ] be the sequence of input 3D pose representations for time-step t = 1 to T. Here, a single pose frame is represented by a set of joint angle parameters in the kinematic representation form. Similarly, the output motion sequence is represented by XT +1:T , where (T T) is the length of predicted sequence. The objective is to learn P(XT +1:T |X1:T ), i.e. the model should predict future pose dynamics conditioned on a given past motion sequence. The prime complexity in the generation of human motion sequence can be analyzed in two folds. Firstly, the generative model should predict plausible human pose representation at each time-step. Understanding the joint angle limits while generating a 3D human pose can be considered as the most important trait to avoid prediction of implausible joint angles. Secondly, the sequence of pose dynamics should be coherent to resemble like a real human motion dynamics. Previous methods do not address this complexities in human-motion modeling individually. A single recurrent network is employed for human motion prediction, as a black-box, to handle both the above complexities in the output motion prediction. Diverted from this general trend, we plan to ﬁrst learn a continuous pose embedding space independent of the motion dynamics to avoid prediction of unlikely or improbable skeleton joint parameters. This is crucial, especially for models targeting long-term motion prediction, as short-term motion for less than 200ms constitutes minimal diversity in the forecasted pose with respect to the immediate past frames.

Learning of Pose Embedding Representation The objective is to learn a pose embedding space, zpose so that P(zpose) models the distribution of only plausible joint angle arrangements. The ﬁrst choice is to use a generative adversarial network to model the same, which will include a pose generator (or decoder), DEpose as a transformation from zpose P(zpose) to the actual skeletal pose,

xpose P(xpose). We emphasize learning of a generative model instead of a simple auto-encoder as the objective is to learn a continuous pose embedding space, which can allow effective interpolation of pose sequence between two plausible pose frames (Radford, Metz, and Chintala 2015). A simple autoencoder without explicit enforcement of being generative leads to learning of a discrete pose embedding space modeling only the available training samples, and hence delivers sub-optimal interpolation results. The core idea is to interpret pose sequence in later stage of the motion prediction framework, as a trajectory in the pose embedding space. Such setting not only enforces prediction of plausible pose frames, but also reduces burden on the subsequent sequence learning framework by segregating the complex task of efﬁcient pose sequence prediction. Following the idea of modeling human motion as a trajectory in the pose embedding space, the pose sequence decoder network must output zpose sequence instead of xpose sequence directly, as attempted by previous approaches (Li et al. 2018). Similarly, the pose sequence encoder architecture would also take zpose sequence as input representation. This asks for an inference function to transform xpose to the corresponding zpose, which is realized by introducing a pose encoder, ENpose. Motivated from adversarial auto-encoder framework (Makhzani et al. 2015), we train the full adversarial pose autoencoder by employing a pose discriminator, which can distinguish between predicted and actual skeletal joint angle patterns. Cyclic reconstruction loss is added on both xpose and zpose to enforce learning of an one-to-one mapping in a generative adversarial setup.

Lcyc = |xpose ˆxpose| + |zpose ˆzpose| Where, ˆxpose = DEpose(ENpose(xpose)) and ˆzpose = ENpose(DEpose(zpose))

Here, zpose is sampled from a predeﬁned prior distribution P(zpose). Note that, ENpose is trained using only Lcyc loss, whereas DEpose is trained using Lcyc + λLadv. Furthermore, effectiveness of the model is evaluated by visualizing interpolation results between two randomly chosen pose frames. A balance between the cyclic reconstruction loss, Lcyc and the adversarial discriminator loss, Ladv is maintained by exploring an effective relative weighting scheme. This is crucial, as more emphasize on cyclic reconstruction loss may derail the the setup towards learning a discrete embedding space with deteriorated generalization on novel pose samples.

Probabilistic Motion Prediction Framework After obtaining an effective pose descriptor from the learned pose embedding space, we focus on modeling the temporal aspect of pose dynamics. Different human motion categories will form a certain type of trajectory in the learned pose embedding manifold. Note that, the trajectory should constitute smooth transitions of zpose as a result of the probabilistic generative approach to train the embedding representation. The resultant transformation functions viz. ENpose and DEpose with frozen learned parameters is utilized in later

z T z1 z T-1

RNN-encoder

b. RNN encoder-decoder

RNN-decoder

ENpose DEpose

a. Pose Embedding

z T z1 z T-1 z T z T+1 z T -1

RNNb RNNb RNNb RNNb RNNb RNNb

RNNf RNNf RNNf RNNf RNNf RNNf

DISCWGAN DISCr

c. Discriminator Architecture

Motion sequence as trajectories in

pose manifold

z1 z2 z3 z4 z5

xpose xpose

Figure 2: Illustration of the full Bi HMP-GAN pipeline. Note that, RNNdec is modeled in a residual setup, where each cell t predicts ˆzt which is added with ˆzt 1 to obtain ﬁnal prediction ˆzt.

stage to effectively model human motion as a trajectory in the learned pose embedding. This is realized by introducing a recurrent sequence encoder RNNenc and a decoder RNNdec architecture as shown in Figure 2. RNNenc takes a sequence of pose embeddings as input, which can be represented as Z1:T = [z1, z2, ...z T ] = [ENpose(x1), ENpose(x2), ...ENpose(x T )]. The ﬁnal hidden sate representation at time T, i.e. henc T is considered as an intrinsic factor required for the prediction of future pose dynamics. To model the inherent stochasticity in the generation of future pose sequence, we introduce an extrinsic factor r. Here r is considered as a random vector drawn from a probability distribution, P(r), which can be taken as either Gaussian or Uniform prior distribution. To inﬂuence the prediction of future pose sequence the decoder recurrent network (RNNdec) takes a tuple of both extrinsic and intrinsic factors, i.e. (henc T , r) as shown in Figure 2. Previous approaches design the decoder as an autoregressive framework, which mostly considers short-term past sequence to regress the next pose representation. An optimum setup would be the one, where the next frame is directly inﬂuenced by both long-term and short-term past representations. Here, the long-term information is related to the global properties of the given past motion dynamics. This includes motion category and other pose and environmental constraints. Whereas, short-term representation constitutes pose dynamics from the immediate past pose enforcing smoothness in the predicted sequence. Motivated by this, we plan to feed a concatenated representation of henc T , r and the chained input from the predicted past pose, as input to the RNNdec at each timestep. Let the predicted sequence output from RNNdec be, ˆZT +1:T = [ˆz T +1, ˆz T +2, ...ˆz T ]. (Note that, here RNNdec is modeled in a residual setup, where each cell t predicts ˆzt which is added with ˆzt 1 to obtain ﬁnal prediction ˆzt). Then, the input at tth time-step to RNNdec will be a concatenated tuple of (henc T , r, ˆzt 1) as shown in Figure 2. The initial hidden state for RNNdec is also a function of both henc T and r. As the sequence decoder predicts the embeddings of actual pose representation, the ﬁnal human pose prediction is obtained by utilizing the frozen DEpose

transformation. Therefore the ﬁnal output, ˆXT +1:T = [DEpose(ˆz T +1), DEpose(ˆz T +2), ...DEpose(ˆz T )].

Discriminator Design Supporting Enforcement of Content Loss The sequence prediction framework is also designed by taking motivations from generative adversarial network. The objective is to enable modeling of variations in prediction of future sequence conditioned on the given past motion i.e. P(XT +1:T |X1:T ). As described above, RNNdec effectively takes two input representations, viz. output of RNNenc and r. Following this, the discriminator takes the predicted pose sequence along with the input conditioned past frames as shown in Figure 2. Here, the discriminator architecture has 2 output heads, viz, a)DISCW GAN and b) DISCr. The discriminator not only outputs a single neuron for the usual adversarial loss, but also predicts the random r vector which is being used to generate the corresponding predicted sequence. Moreover, a separate critic network is introduced with similar architecture with a single output-head named as DISCcritic. A binary cross entropy loss is applied on the output of DISCcritic after the ﬁnal sigmoid nonlinearity to learn a discriminative function to distinguish between the predicted and actual pose sequence. The single neuron output of DISCW GAN is used to enforce minimization of Earth Mover Distance (EMD) as proposed by Arjovsky, Chintala, and Bottou (2017). Note that adversarial loss from only DISCW GAN is used to train the RNN encoder-decoder parameters for learning stability; following implementation tricks by Gulrajani et al. (2017). The additional output-head DISCr attached to the discriminator, is a novel approach to regress the r vector, which can generate the input future sequence given the past motion dynamics. The prime motivation behind incorporation of DISCr can be of two folds. First, being able to regress r while training the encoder-decoder parameters enforces learning of an oneto-one mapping avoiding mode-collapse. Secondly, it offers a new direction to enforce content information directly on the predicted motion sequence. Consider, there exists a particular r which can generate the future frames exactly as it is given in a chosen training sample of length T . Now to be able to enforce a content loss directly on the predicted se-

ʎadv [ X1:T XT+1:T ]

RNN encoder

r~p(r) Disc

RNN decoder

RNN decoder

Figure 3: Workﬂow illustrating enforcement of content loss in Bi HMP-GAN as indicated by purple arrows.

quence of RNNdec, we ﬁrst perform an inference of the full sequence (of length T ) through the trained discriminator to obtain a speciﬁc r vector from the output head DISCr. This r is later utilized in the next iteration to enforce a direct content loss between predicted ﬁnal pose sequence, ˆXT +1:T and the ground-truth, XT +1:T as described in Figure 3.

/*Initialization of parameters */ ΘENC : Parameters of RNNenc ΘDEC : Parameters of RNNdec ΘDISC : Parameters of RNNdisc for k iterations do

for m steps do

X1:T +αT : minibatch training motion sequence r: random minibatch sampled from prior p(r) ˆXr T +1:T = RNNdec( RNNenc(X1:T ) r ) Ldisc adv = DISCW GAN( X1:T ˆXr T +1:T ) DISCW GAN( X1:T XT +1:T ) Lr rec = |r DISCr( X1:T ˆXr T +1:T )|

/* Update parameters for DISC network*/ ΘDISC := argmin ΘDISC (Ldisc adv + λr Lr rec)

end r = DISCr( X1:T XT +1:T ) ˆXr T +1:T = RNNdec( RNNenc(X1:T ) r ) LX content = |XT +1:T ˆXr T +1:T | Lgen adv = DISCW GAN( X1:T ˆXr T +1:T )

/* Update parameters of RNNenc and RNNdec */ ΘDEC := argmin ΘDEC (Lgen adv + λr Lr rec + λc LX content)

ΘENC := argmin ΘENC (Lgen adv + λc LX content)

end Algorithm 1: Training algorithm for Bi HMP-GAN, with explicit enforcement of direct content loss.

Regularization by Recursive Prediction To further regularize the training procedure, we incorporate recursive prediction of motion sequence. Consider an input motion sequence, X1:αT of length αT , where α is some integer value depending on the available sequence length of a training sample. First, the prediction framework is used to obtain ˆXT +1:T by considering X1:T as past motion sequence. Following this, ˆXT +1:2T is obtained for α = 2 by conditioning on the predicted past sequence, i.e. ˆXT T +1:T as input dynamics. In general for a particular α value, ˆX(α 1)T +1:αT is obtained by considering the intrinsic input factor as a function of ˆX(α 1)T T +1:(α 1)T . But as discussed above, speciﬁc intrinsic factor rα is required for each α value to be able to enforce a direct content loss in the probabilistic framework. rα is obtained from the discriminator head, DISCr for the following concatenated input sequence: [X(α 1)T T :(α 1)T , X(α 1)T +1:αT ] for each recursive α step. This regularization not only improves our long-term prediction results, but also acts like an effective solution to avoid convergence to mean pose unlike previous state-of-the-arts.

Discriminator architecture HP-GAN proposes to utilize the full motion length X1:T XT +1:T as input to the recurrent pose discriminator architecture ( represents concatenation operation). The goal is to match P(XT +1:T |X1:T ) distribution with P( ˆXr T +1:T |X1:T ) for some r p(r) by following the generative adversarial learning technique. Unlike HP-GAN, to effectively capture P( ˆXr T +1:T |X1:T ) we propose certain intuitive modiﬁcations to the discriminator architecture. The qualitative results of HP-GAN (Barsoum, Kender, and Liu 2017) clearly highlights the catastrophic drift in the initial pose predictions as compared to the immediate past. In general, by effectively modeling P(XT +τ|X1:T ) for a very small τ (i.e. less than 50ms) such spurious drifts can be avoided in the predicted motion sequence. This way, we enforce the model to learn less diversity in the prediction of initial τ frames for any r p(r), extrinsic factor, hence avoiding the catastrophic drift in the generations. Following this we model both P(XT +τ|X1:T ) and P(XT 1 τ|XT +1:T ) by employing a bidirectional recurrent neural network as shown in Figure 2. We utilize the idea of plausible trajectory in the learned pose embedding, by feeding the sequence of pose embedding representations (i.e. the output of ENpose) to the bidirectional recurrent architecture. Final output of the discriminator is extracted from 4 different hidden representations i.e. ﬁnal hidden state of both forward and backwards recurrent RNN along with hforwrd(T + τ) and hbackward(T 1 τ) as shown in Figure 2.

Experiments In this section we describe experimental details of Bi HMPGAN along with analysis of both qualitative and quantitative results on two publicly available datasets; viz. a) Human 3.6M (Ionescu et al. 2014) and CMU MOCAP. The full pipeline of Bi HMP-GAN is implemented in tensorﬂow with ADAM optimizer. We use a batch size of 32

Table 1: Comparison of motion prediction error on Human 3.6M dataset for short-term (80ms, 160ms, 320ms, 400ms) and long-term(1000ms) prediction.Bi HMP-GAN clearly outperforms others in long-term prediction.

Walking Eating Smoking Discussion ms 80 160 320 400 1000 80 160 320 400 1000 80 160 320 400 1000 80 160 320 400 1000 RRNN 0.33 0.56 0.78 0.85 1.14 0.26 0.43 0.66 0.81 1.34 0.35 0.64 1.03 1.15 1.83 0.37 0.77 1.06 1.10 1.79 Conv-Motion 0.33 0.54 0.68 0.73 0.92 0.22 0.36 0.58 0.71 1.24 0.26 0.49 0.96 0.92 1.62 0.32 0.67 0.94 1.01 1.86 HP-GAN(minerr) 0.95 1.17 1.69 1.79 2.47 1.28 1.47 1.70 1.82 2.51 1.71 1.89 2.33 2.42 3.2 2.29 2.61 2.79 2.88 3.67 Ours(minerr) 0.33 0.52 0.64 0.69 0.88 0.21 0.33 0.55 0.71 1.20 0.26 0.49 0.91 0.88 1.12 0.32 0.65 0.92 9.98 1.78 Ours(r ) 0.33 0.52 0.63 0.67 0.85 0.20 0.33 0.54 0.70 1.20 0.26 0.50 0.91 0.86 1.11 0.33 0.65 0.91 9.95 1.77

Table 2: Ablation analysis on Human 3.6M, reporting mean average error (across 15 categories) at 1000ms

Metrics r = DISCr r = argminerr Without pose embedding 1.76 1.76

Without encoder state in chaining 1.71 1.72

Without recursive prediction 1.69 1.69

Bi HMP-GAN 1.67 1.68

Table 3: Quantitative comparison with HP-GAN (classiﬁer accuracy on real test samples of Human 3.6M: 55.4%). We use the proposed discriminator architecture to design critic network for HP-GAN, which can easily detect the catastrophic drift in the initial predicted sequence

Accuracy Motion Classiﬁer Critic

HP-GAN 9.8 18.5

Bi HMP-GAN 41.2 74.6

with learning rate set at 0.00005. Single layer LSTM (Chung et al. 2014) with 512 hidden units is incorporated as a recurrent architecture for both sequence encoder, decoder and bidirectional discriminator network. Following previous motion prediction works (Li et al. 2018; Martinez, Black, and Romero 2017) the length of intrinsic past pose sequence is set to 50, i.e. 2 seconds of skeleton motion at 25 fps setting. Considering fair evaluation on long-term prediction, the length of predicted motion sequence is set to 25. We choose τ=1 for the modiﬁed discriminator architecture. The value of α for the recursive prediction regularization is set to 2. Instead of training the recurrent encoder-decoder parameters with addition of all the loss functions described above, we sequentially iterate over LX content and the recursive content regularization loss separately from the adversarial loss, Ldisc adv + λr Lr rec by deﬁning different ADAM optimizers for each of them. We choose N(0, 1) prior distribution for both zpose and r with 32 and 8 dimensions respectively. To ensure fair comparison, we trained HP-GAN (Barsoum, Kender, and Liu 2017) on Human 3.6M dataset with the same setting of sequence lengths and input representations using the publicly available implementation.

Datasets Human 3.6M is a widely accepted dataset for benchmarking human motion prediction works as it constitutes highly diverse action categories with actions performed by multiple subjects. Preprocessing and data selection criteria is directly followed from the recent work of Li et al. (2018). We ﬁnally

use a 54 dimensional input representation as xpose eliminating global orientation and translation parameters. Euclidean error on the predicted Euler angles is considered as an evaluation metric for comparison of Bi HMP-GAN against previous state-of-the-art motion prediction methods. We also report performance of Bi HMP-GAN on CMU motion capture dataset to demonstrate generalization of the proposed probabilistic prediction method. We follow the preprocessing and data selection criteria from Li et al. (2018), which ﬁnally selects eight action categories after pruning interaction based and other repeated action categories.

Comparison with other generative approaches We ﬁrst compare our prediction performance against the available generative model HP-GAN (Barsoum, Kender, and Liu 2017). After training HP-GAN on the same settings for Human 3.6M dataset, efﬁcacy of the predicted motion is evaluated by quantifying discriminability of a critic network to classify between the generated and real motion dynamics. Note that, we have employed the proposed modiﬁed discriminator architecture for the critic network to speciﬁcally consider the initial drift in predicted motion (see Table 3). We also report performance of the generated motion by feeding the concatenated seed sequence and the generated motion to an action classiﬁer trained only on real human motion dynamics(see Table 3). Both qualitative (see Figure 4) and quantitative (see Table 3) results clearly demonstrate superiority of Bi HMP-GAN. As a generative model, unlike HPGAN, Bi HMP-GAN is able to predict diverse prediction sequences without loosing the coherence with immediate past conditioning.

Comparison with other deterministic approaches For each test sample X1:T of length T , there exist a particular r which can model the exact predicted motion as ˆXr T +1:T = RNNdec(RNNenc(X1:T ) r ). Therefore, modeling expressibility of a generative method can be evaluated by obtaining the best possible value of r which can express a given test sample. Motivated by this, we deﬁne two different metrics to quantitatively assess the quality of non-deterministic predictions. Firstly, considering r = DISCr(X1:T XT +1:T ), we report the prediction error of ˆXr T +1:T against the corresponding ground-truth XT +1:T , which is denoted as Ours(r ) in Table 1 and 4. The metrics clearly demonstrate quality of the generated motion for both short-term (80 ms, 160 ms, 320 ms and 400 ms) and long-term prediction (1000 ms). Improved results on long-term prediction performance

Table 4: Comparison of motion prediction error on CMU MOCAP dataset for short-term (80ms, 160ms, 320ms, 400ms) and long-term(1000ms) prediction. Bi HMP-GAN clearly outperforms others in long-term prediction.

Basketball Basketball Signal Directing Trafﬁc Jumping ms 80 160 320 400 1000 80 160 320 400 1000 80 160 320 400 1000 80 160 320 400 1000 RRNN 0.50 0.80 1.27 1.45 1.78 0.41 0.76 1.32 1.54 2.15 0.33 0.59 0.93 1.10 2.05 0.56 0.88 1.77 2.02 2.40 Conv-Motion 0.37 0.62 1.07 1.18 1.95 0.32 0.59 1.04 1.24 1.96 0.25 0.56 0.89 1.00 2.04 0.39 0.60 1.36 1.56 2.01 Ours(minerr) 0.36 0.60 1.02 1.12 1.84 0.33 0.56 1.00 1.19 1.89 0.25 0.52 0.84 0.96 1.97 0.38 0.57 1.32 1.51 1.94 Ours(r ) 0.37 0.62 1.01 1.11 1.83 0.32 0.56 1.01 1.18 1.88 0.25 0.51 0.85 0.96 1.95 0.39 0.57 1.31 1.50 1.93 Running Soccer Walking Washwindow ms 80 160 320 400 1000 80 160 320 400 1000 80 160 320 400 1000 80 160 320 400 1000 RRNN 0.33 0.50 0.66 0.75 1.00 0.29 0.51 0.88 0.99 1.72 0.35 0.47 0.60 0.65 0.88 0.30 0.46 0.72 0.91 1.36 Conv-Motion 0.28 0.41 0.52 0.57 0.67 0.26 0.44 0.75 0.87 1.56 0.35 0.44 0.45 0.50 0.78 0.30 0.47 0.80 1.01 1.39 Ours(minerr) 0.27 0.40 0.49 0.54 0.65 0.26 0.43 0.71 0.84 1.52 0.34 0.44 0.43 0.47 0.71 0.30 0.48 0.76 0.98 1.32 Ours(r ) 0.28 0.40 0.50 0.53 0.62 0.26 0.44 0.72 0.82 1.51 0.35 0.45 0.44 0.46 0.72 0.31 0.46 0.77 0.92 1.31

(time in ms)

40 120 200 280 360 440 520 600 1840 1920 2000 40 120 200 280 360 440 520 600 1840 1920 2000

HP-GAN Bi HMP-GAN

Figure 4: Qualitative results on Human 3.6M dataset on eating category. It illustrates variations in forcasted motion (greenpurple) for a given seed sequence (red-blue) as modeled by HP-GAN and Bi HMP-GAN. The last row shows the motion sequence generated via minerr strategy. We highlight the catastrophic drift in the predicted motion of HP-GAN by dotted red box. We observe generation of unrealistic pose for long-term predictions by HP-GAN (highlighted in pink box), as it does not enforce generation of plausible pose frame. Also, generations of HP-GAN for a given seed sequence, lack variation for different latent vector r, as opposed to Bi HMP-GAN.

shows effectiveness Bi HMP-GAN in overcoming the phenomenon of convergence to mean pose, which is quite evident in deterministic approaches; RRNN (Martinez, Black, and Romero 2017) and Conv-Motion (Li et al. 2018). However, in the previous metric comparison, we have to use the ground-truth prediction XT +1:T as an input to the discriminator to obtain a particular vector r . Hence, we also propose another metric, to assess expressibility of the probabilistic motion prediction model as follows. We ﬁrst save a batch of 1000 vectors ri randomly sampled from the prior distribution P(r). Then, for each test sample X1:T we report the minimum Euclidean error as, min Error( ˆXri T +1:T , XT +1:T ) for i=1,2,...1000. Table 1 and 4 holds comparison of this metric under the row heading Ours(minerr) and HP-GAN(minerr). It clearly highlights expressiveness of Bi HMP-GAN against HP-GAN and other the deterministic approaches.

Ablation study

Here, we quantitatively analyze effectiveness of various design and learning schemes proposed for Bi HMP-GAN. To demonstrate the advantage of learning pose embedding representation, we compare Bi HMP-GAN against a baseline without the pose embedding transformations (see Table 2). For the decoder setup, the effect of feeding concatenated previous pose feature along with the intrinsic encoder hid-

den state is evaluated against a baseline; with input sequence of only chained previous pose feature (See Table 2). Finally, the effect of incorporating recursive prediction regularization in the training of Bi HMP-GAN is demonstrated against a baseline designed without any such regularization.

In this work, we proposed a novel probabilistic generative model for prediction of uncertain future motion dynamics. Being generative we have carefully designed the framework to model the available training sequences with a direct content loss. Modeling human motion as a trajectory in pose embedding makes Bi HMP-GAN devoid of generating unrealistic pose frames as compared to other approaches. We demonstrate improved expressibility of Bi HMP-GAN specially for long-term motion prediction against other deterministic motion prediction works. In future, we plan to extend similar training framework for complex motion sequences like, dance, martial arts etc. by aiming towards achieving a general motion embedding.

Acknowledgements

This work was supported by a CSIR Fellowship (Jogendra), and a project grant from Robert Bosch Centre for Cyber Physical Systems, IISc.

References Arikan, O.; Forsyth, D. A.; and O Brien, J. F. 2003. Motion synthesis from annotations. In ACM Transactions on Graphics (TOG), volume 22, 402 408. ACM. Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875. Barsoum, E.; Kender, J.; and Liu, Z. 2017. HP-GAN: probabilistic 3d human motion prediction via GAN. Co RR. Butepage, J.; Black, M. J.; Kragic, D.; and Kjellstrom, H. 2017. Deep representation learning for human motion prediction and classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2172 2180. Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555. Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, 4346 4354. Ghosh, P.; Song, J.; Aksan, E.; and Hilliges, O. 2017. Learning human motion models for long-term predictions. In 3D Vision (3DV), 2017 International Conference on. IEEE. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672 2680. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 5767 5777. Ionescu, C.; Papava, D.; Olaru, V.; and Sminchisescu, C. 2014. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36(7):1325 1339. Jain, A.; Zamir, A. R.; Savarese, S.; and Saxena, A. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5308 5317. Koppula, H. S., and Saxena, A. 2013. Anticipating human activities for reactive robotic response. In IROS, 2071. Tokyo. Li, Z.; Zhou, Y.; Xiao, S.; He, C.; Huang, Z.; and Li, H. 2017. Auto-conditioned recurrent networks for extended complex human motion synthesis. ar Xiv preprint ar Xiv:1707.05363. Li, C.; Zhang, Z.; Lee, W. S.; and Lee, G. H. 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5226 5234.

Mainprice, J., and Berenson, D. 2013. Human-robot collaborative manipulation planning using early prediction of human motion. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, 299 306. IEEE. Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and Frey, B. 2015. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644. Martinez, J.; Black, M. J.; and Romero, J. 2017. On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4674 4683. IEEE. Pavlovic, V.; Rehg, J. M.; and Mac Cormick, J. 2001. Learning switching linear models of human motion. In Advances in neural information processing systems, 981 987. Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104 3112. Taylor, G.; Hinton, G.; and Roweis, S. 2007. Modeling human motion using binary latent variables. Advances in neural information processing systems 19:1345. Wang, J. M.; Fleet, D. J.; and Hertzmann, A. 2008. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence 30(2):283 298.