# learning_calibratable_policies_using_programmatic_styleconsistency__b57c3d1a.pdf

Learning Calibratable Policies using Programmatic Style-Consistency

Eric Zhan 1 Albert Tseng 1 Yisong Yue 1 Adith Swaminathan 2 Matthew Hausknecht 2

Abstract We study the problem of controllable generation of long-term sequential behaviors, where the goal is to calibrate to multiple behavior styles simultaneously. In contrast to the well-studied areas of controllable generation of images, text, and speech, there are two questions that pose significant challenges when generating long-term behaviors: how should we specify the factors of variation to control, and how can we ensure that the generated behavior faithfully demonstrates combinatorially many styles? We leverage programmatic labeling functions to specify controllable styles, and derive a formal notion of styleconsistency as a learning objective, which can then be solved using conventional policy learning approaches. We evaluate our framework using demonstrations from professional basketball players and agents in the Mu Jo Co physics environment, and show that existing approaches that do not explicitly enforce style-consistency fail to generate diverse behaviors whereas our learned policies can be calibrated for up to 45(1024) distinct style combinations.

1. Introduction

The widespread availability of recorded tracking data is enabling the study of complex behaviors in many domains, including sports (Chen et al., 2016a; Le et al., 2017b; Zhan et al., 2019; Yeh et al., 2019), video games (Kurin et al., 2017; Broll et al., 2019; Hofmann, 2019), laboratory animals (Eyjolfsdottir et al., 2014; 2017; Branson et al., 2009; Johnson et al., 2016), facial expressions (Suwajanakorn et al., 2017; Taylor et al., 2017), commonplace activities such as cooking (Nishimura et al., 2019), and transportation (Bojarski et al., 2016; Luo et al., 2018; Li et al., 2018; Chang et al., 2019). A key aspect of modern behavioral datasets is

1California Institute of Technology, Pasadena, CA 2Microsoft Research, Redmond, WA. Correspondence to: Eric Zhan <ezhan@caltech.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

that the behaviors can exhibit very diverse styles (e.g., from multiple demonstrators). For example, Figure 1a depicts demonstrations from basketball players with variations in speed, desired destinations, and curvature of movement.

The goal of this paper is to study controllable generation of diverse behaviors by learning to imitate raw demonstrations; or more technically, to develop style-calibrated imitation learning methods. A controllable, or calibratable, policy would enable the generation of behaviors consistent with various styles, such as low movement speed (Figure 1b), or approaching the basket (Figure 1c), or both styles simultaneously (Figure 1d). Style-calibrated imitation learning methods that can yield such policies can be broadly useful to: (a) perform more robust imitation learning from diverse demonstrations (Wang et al., 2017; Broll et al., 2019), (b) enable diverse exploration in reinforcement learning agents (Co-Reyes et al., 2018), or (c) visualize and extrapolate counterfactual behaviors beyond those seen in the dataset (Le et al., 2017a), amongst many other tasks.

Performing style-calibrated imitation is a challenging task. First, what constitutes a style ? Second, when can we be certain that a policy is calibrated when imitating a style? Third, how can we scale policy learning to faithfully generate combinatorially many styles? In related tasks like controllable image generation, common approaches for calibration use adversarial information factorization or mutual information between generated images and user-speciﬁed styles (e.g. gender, hair length, etc.) (Creswell et al., 2017; Lample et al., 2017; Chen et al., 2016b). However, we ﬁnd that these indirect approaches fall well short of generating calibratable sequential behaviors. Intuitively, the aforementioned objectives provide only indirect proxies for style-calibration. For example, Figure 2 illustrates that an indirect baseline approach struggles to reliably generate trajectories to reach a certain displacement, even though the dataset contains many examples of such behavior.

Research questions. We seek to answer three research questions while tackling this challenge. The ﬁrst is strategic: since high-level stylistic attributes like movement speed are typically not provided with the raw demonstration data, what systematic form of domain knowledge can we leverage to quickly and cleanly extract highly varied style information from raw behavioral data? The second is formulaic: how can

Learning Calibratable Policies using Programmatic Style-Consistency

(a) Expert demonstrations

(b) Style: SPEED

(c) Style: DESTINATION

(d) Both styles

Figure 1. Basketball trajectories from policies that are: (a) the expert; (b) calibrated to move at low speeds; (c) calibrated to end near the

basket (within green boundary); and (d) calibrated for both (b,c) simultaneously. Diamonds ( ) and dots ( ) are initial and ﬁnal positions.

(a) Baseline, low displacement

(b) Ours, low displacement

(c) Baseline, high displacement

(d) Ours, high displacement

Figure 2. Basketball trajectories sampled from baseline policies and our models calibrated to the style of DISPLACEMENT with 6 classes

corresponding to regions separated by blue lines. Diamonds ( ) and dots ( ) indicate initial and ﬁnal positions respectively. Each policy is conditioned on a label class for DISPLACEMENT (low in (a,b), high in (c,d)). Green dots indicate trajectories that are consistent with the style label, while red dots indicate those that are not. Our policy (b,d) is better calibrated for this style than the baselines (a,c).

we formalize the learning objective to encourage learning style-calibratable policies that can be controlled to realize many diverse styles? The third is algorithmic: how do we design practical learning approaches that reliably optimize the learning objective?

Our contributions. To address these questions, we present a novel framework inspired by data programming (Ratner et al., 2016), a paradigm in weak supervision that utilizes automated labeling procedures, called labeling functions, to learn without ground-truth labels. In our setting, labeling functions enable domain experts to quickly translate domain knowledge of diverse styles into programmatically generated style annotations. For instance, it is trivial to write programmatic labeling functions for the styles depicted in Figures 1 & 2 (speed and destination). Labeling functions also motivate a new learning objective, which we call programmatic style-consistency: rollouts generated by a policy calibrated for a particular style should return the same style label when fed to the programmatic labeling function. This notion of style-consistency provides a direct approach to measuring how calibrated a policy is, and does not suffer from the weaknesses of indirect approaches such as mutual information estimation. In the basketball example of scoring

when near the basket, trajectories that perform correlated events (like turning towards the basket) will not return the desired style label when fed to the labeling function that checks for scoring events. We elaborate on this in Section 4.

We demonstrate style-calibrated policy learning in Basketball and Mu Jo Co domains. Our experiments highlight the modularity of our approach we can plug-in any policy class and any imitation learning algorithm and reliably optimize for style-consistency using the approach of Section 5. The resulting learned policies can achieve very ﬁne-grained and diverse style-calibration with negligible degradation in imitation quality for example, our learned policy is calibrated to 45(1024) distinct style combinations in Basketball.

2. Related Work

Our work combines ideas from policy learning and data programming to develop a weakly supervised approach for more explicit and ﬁne-grained calibration. As such, our work is related to learning disentangled representations and controllable generative modeling, reviewed below.

Imitation learning of diverse behaviors has focused on unsupervised approaches to infer latent variables/codes that

Learning Calibratable Policies using Programmatic Style-Consistency

capture behavior styles (Li et al., 2017; Hausman et al., 2017; Wang et al., 2017). Similar approaches have also been studied for generating text conditioned on attributes such as sentiment or tense (Hu et al., 2017). A typical strategy is to maximize the mutual information between the latent codes and trajectories, in contrast to our notion of programmatic style-consistency.

Disentangled representation learning aims to learn representations where each latent dimension corresponds to exactly one desired factor of variation (Bengio et al., 2012). Recent studies (Locatello et al., 2019) have noted that popular techniques (Chen et al., 2016b; Higgins et al., 2017; Kim & Mnih, 2018; Chen et al., 2018) can be sensitive to hyperparameters and that evaluation metrics can be correlated with certain model classes and datasets, which suggests that fully unsupervised learning approaches may, in general, be unreliable for discovering cleanly calibratable representations. We avoid this roadblock by relying on programmatic labeling functions to provide weak supervision.

Conditional generation for images has recently focused on attribute manipulation (Bao et al., 2017; Creswell et al., 2017; Klys et al., 2018), which aims to enforce that changing a label affects only one aspect of the image (similar to disentangled representation learning). We extend these models and compare with our approach in Section 6. Our experiments suggest that these algorithms do not necessarily scale well into sequential domains.

Enforcing consistency in generative modeling, such as cycle-consistency in image generation (Zhu et al., 2017), and self-consistency in hierarchical reinforcement learning (Co-Reyes et al., 2018) has proved beneﬁcial. The former minimizes a discriminative disagreement, whereas the latter minimizes a distributional disagreement between two sets of generated behaviors (e.g., KL-divergence). From this perspective, our style-consistency notion is more similar to the former; however we also enforce consistency over multiple time-steps, which is more similar to the latter.

Goal-conditioned policy learning considers policies that take as input the current state along with a desired goal state (e.g., a location), and then must execute a sequence of actions to achieve the goal states. In some cases, the goal states are provided exogenously (Zheng et al., 2016; Le et al., 2018; Broll et al., 2019; Ding et al., 2019), and in other cases the goal states are learned as part of a hierarchical policy learning approach (Co-Reyes et al., 2018; Sharma et al., 2020) in a way that uses a self-consistency metric similar to our style-consistency approach. Our approach can be viewed as complementary to these approaches as the goal is to study more general notions of consistency (e.g., our styles subsume goals as a special case) as well as to scale to combinatorial joint style spaces.

Hierarchical control via learning latent motor dynamics is concerned with recovering a latent representation of motor control dynamics such that one can easily design controllers in the latent space (which then get decoded into actions). The high level controllers can then be designed afterwards in a pipelined workﬂow (Losey et al., 2020; Ling et al., 2020; Luo et al., 2020). The controllers are effective for short time horizons and focus on ﬁnding good representations of complex dynamics, whereas we focus on controlling behavior styles that can span longer horizons.

3. Background: Imitation Learning for

Behavior Trajectories

Since our focus is on learning style-calibratable generative policies, for simplicity we develop our approach with the basic imitation learning paradigm of behavioral cloning. Interesting future directions include composing our approach with more advanced imitation learning approaches like DAGGER (Ross et al., 2011), GAIL (Ho & Ermon, 2016) as well as with reinforcement learning.

Notation. Let S and A denote the environment state and action spaces. At each timestep t, an agent observes state st 2 S and executes action at 2 A using a policy : S ! A. The environment then transitions to the next state st+1 according to a (typically unknown) dynamics function f : S A ! S. For the rest of this paper, we assume f is deterministic; a modiﬁcation of our approach for stochastic f is included in Appendix B. A trajectory is a sequence of T state-action pairs and the last state: = {(st, at)}T

t=1[{s T +1}. Let D be a set of N trajectories collected from expert demonstrations. In our experiments, each trajectory in D has the same length T, but in general this does not need to be the case.

Learning objective. We begin with the basic imitation learning paradigm of behavioral cloning (Syed & Schapire, 2008). The goal is to learn a policy that behaves like the pre-collected demonstrations:

Limitation( , )

where Limitation is a loss function that quantiﬁes the mismatch between actions chosen by and those in the demonstrations. Since we are primarily interested in probabilistic or generative policies, we typically use (variants of) negative log-density: L( , ) = PT

t=1 log (at|st), where (at|st) is the probability of picking action at in st.

Policy class of . Common model choices for instantiating include sequential generative models like recurrent Neural Networks (RNN) and trajectory variational autoencoders (TVAE). TVAEs introduce a latent variable z (also called a trajectory embedding), an encoder network qφ, a policy

Learning Calibratable Policies using Programmatic Style-Consistency

decoder , and a prior distribution p on z. They have been shown to work well in a range of generative policy learning settings (Wang et al., 2017; Ha & Eck, 2018; Co-Reyes et al., 2018), and have the following imitation learning objective:

Ltvae( , ; qφ) = Eqφ(z| )

log (at|st, z)

q (z| )||p(z)

The ﬁrst term in (2) is the standard negative log-density that the policy assigns to trajectories in the dataset, while the second term is the KL-divergence between the prior and approximate posterior of trajectory embeddings z. The main shortcoming of TVAEs and related approaches, which we address in Sections 4 & 5, is that the resulting policies cannot be easily calibrated to generate speciﬁc styles. For instance, the goal of the trajectory embedding z is to capture all the styles that exist in the expert demonstrations, but there is no guarantee that the embeddings cleanly encode the desired styles in a calibrated way. Previous work has largely relied on unsupervised learning techniques that either require signiﬁcant domain knowledge (Le et al., 2017b), or have trouble scaling to complex styles commonly found in real-world applications (Wang et al., 2017; Li et al., 2017).

4. Programmatic Style-consistency

Building upon the basic setup in Section 3, we focus on the setting where the demonstrations D contain diverse behavior styles. To start, let y 2 Y denote a single style label (e.g., speed or destination, as shown in Figure 1). Our goal is to learn a policy that can be explicitly calibrated to y, i.e., trajectories generated by ( |y) should match the demonstrations in D that exhibit style y.

Obtaining style labels can be expensive using conventional annotation methods, and unreliable using unsupervised approaches. We instead utilize easily programmable labeling functions that automatically produce style labels. We then formalize a notion of style-consistency as a learning objective, and in Section 5 describe a practical learning approach.

Labeling functions. Introduced in the data programming paradigm (Ratner et al., 2016), labeling functions programmatically produce weak and noisy labels to learn models on otherwise unlabeled datasets. A signiﬁcant beneﬁt is that labeling functions are often simple scripts that can be quickly applied to the dataset, which is much cheaper than manual annotations and more reliable than unsupervised methods. In our framework, we study behavior styles that can be represented as labeling functions, which we denote λ, that map trajectories to style labels y. For example:

λ( ) = 1{ks T +1 s1k2 > c}, (3)

which distinguishes between trajectories with large (greater

than a threshold c) versus small total displacement. We experiment with a range of labeling functions, as described in Section 6. Many behavior styles used in previous work can be represented as labeling functions, e.g., agent speed (Wang et al., 2017). Multiple labeling functions can be provided at once resulting in a combinatorial space of joint style labels. We use trajectory-level labels λ( ) in our experiments, but in general labeling functions can be applied on subsequences λ( t:t+h) to obtain per-timestep labels, e.g., agent goal (Broll et al., 2019). We can efﬁciently annotate datasets using labeling functions, which we denote as λ(D) = {( i, λ( i))}N

i=1. Our goal can now be phrased as: given λ(D), train a policy : S Y 7! A such that ( |y) is calibrated to styles y found in λ(D).

Style-consistency. A key insight in our work is that labeling functions naturally induce a metric for calibration. If a policy ( |y) is calibrated to λ, we would expect the generated behaviors to be consistent with the label. So, we expect the following loss to be small:

Ey p(y), ( |y)

where p(y) is a prior over the style labels, and is obtained by executing the style-conditioned policy in the environment. Lstyle is thus a disagreement loss over labels that is minimized at λ( ) = y, e.g., Lstyle'

= 1{λ( ) 6= y} for categorical labels. We refer to (4) as the style-consistency loss, and say that ( |y) is maximally calibrated to λ when (4) is minimized. Our learning objective adds (1) with (4):

+ Ey p(y), ( |y)

The simplest choice for the prior distribution p(y) is the

marginal distribution of styles in λ(D). The ﬁrst term in (5) is a standard imitation learning objective and can be tractably estimated using λ(D). To enforce styleconsistency with the second term, conceptually we need to sample several y p(y), then several rollouts ( | y) from the current policy, and query the labeling function for each of them. Furthermore, if λ is a non-differentiable function deﬁned over the entire trajectory, as is the case in (3), then we cannot simply backpropagate the style-consistency loss. In Section 5, we introduce differentiable approximations to more easily optimize the objective in (5).

Combinatorial joint style space. Our notion of styleconsistency can be easily extended to optimize for combinatorially-many joint styles when multiple labeling functions are provided. Suppose we have M labeling functions {λi}M

i=1 and corresponding label spaces {Yi}M

i=1. Let λ denote (λ1, . . . , λM) and y denote (y1, . . . , y M). Style-

Learning Calibratable Policies using Programmatic Style-Consistency

consistency loss becomes:

Ey p(y), ( |y)

Note that style-consistency is optimal when the generated trajectory agrees with all labeling functions. Although challenging to achieve, this outcome is most desirable, i.e. ( |y) is calibrated to all styles simultaneously. Indeed, a key metric that we evaluate is how well various learned policies can be calibrated to all styles simultaneously (i.e., loss of 0 only if all styles are calibrated, and loss of 1 otherwise).

5. Learning Approach

Optimizing (5) is challenging due to the long-time horizon and non-differentiability of the labeling functions λ.1

Given unlimited queries to the environment, one could naively employ model-free reinforcement learning, e.g., estimating (4) using rollouts and optimizing using policy gradient approaches. We instead take a model-based approach, described generically in Algorithm 1, that is more computationally-efﬁcient and decomposable (i.e., transparent). The model-based approach is compatible with batch or ofﬂine learning, and we found it particularly useful for diagnosing deﬁciencies in our algorithmic framework. We ﬁrst introduce a label approximator for λ, and then show how to optimize through the environmental dynamics using a differentiable model-based learning approach.

Approximating labeling functions. To deal with nondifferentiability of λ, we approximate it with a differentiable function Cλ

parameterized by :

Here, Llabel is a differentiable loss that approximates Lstyle, such as cross-entropy loss when Lstyle is the 0/1 loss. In our experiments we use a RNN to represent Cλ

. We then modify the style-consistency term in (5) with Cλ

and optimize:

+ Ey p(y), ( |y)

Optimizing Lstyle over trajectories. The next challenge is to optimize style-consistency over multiple time steps. Consider the labeling function in (3) that computes the difference between the ﬁrst and last states. Our label approximator Cλ

may converge to a solution that ignores all

1This issue is not encountered in previous work on styledependent imitation learning (Li et al., 2017; Hausman et al., 2017), since they use purely unsupervised methods such as maximizing mutual information which is differentiable.

Algorithm 1 Generic recipe for optimizing (5)

1: Input: demonstrations D, labeling functions λ 2: construct λ(D) by applying λ on trajectories in D 3: optimize (7) to convergence to learn Cλ

4: optimize (8) to convergence to learn

inputs except for s1 and s T +1. In this case, Cλ

provides no learning signal about intermediate steps. As such, effective optimization of style-consistency in (8) requires informative learning signals on all actions at every step, which can be viewed as a type of credit assignment problem.

In general, model-free and model-based approaches address this challenge in dramatically different ways and for different problem settings. A model-free solution views this credit assignment challenge as analogous to that faced by reinforcement learning (RL), and repurposes generic reinforcement learning algorithms. Crucially, they assume access to the environment to collect more rollouts under any new policy. A model-based solution does not assume such access and can operate only with the batch of behavior data D; however they can have an additional failure mode since the learned models may provide an inaccurate signal for proper credit assignment. We choose a model-based approach, while exploiting access to the environment when available to reﬁne the learned models, for two reasons: (a) we found it to be compositionally simpler and easier to debug; and (b) we can use the learned model to obtain hallucinated rollouts of any policy efﬁciently during training.

Modeling dynamics for credit assignment. Our modelbased approach utilizes a dynamics model M' to approximate the environment s dynamics by predicting the change in state given the current state and action:

' = arg min

M'(st, at), (st+1 st)

where Ldynamics is often L2 or squared-L2 loss (Nagabandi et al., 2018; Luo et al., 2019). This allows us to generate trajectories by rolling out: st+1 = st+M'

. Then optimizing for style-consistency in (8) would backpropagate through our dynamics model M' and provide informative learning signals to the policy at every timestep.

We outline our model-based approach in Algorithm 2. Lines 12-15 describe an optional step to ﬁne-tune the dynamics

model by querying the environment using the current policy (similar to Luo et al. (2019)); we found that this can improve style-consistency in some experiments. In Appendix B we elaborate how the dynamics model and objective of Eqn (9) is changed if the environment is stochastic.

Discussion. To summarize, we claim that style-consistency

Learning Calibratable Policies using Programmatic Style-Consistency

Algorithm 2 Model-based approach for Algorithm 1

1: Input: demonstrations D, labeling function λ, label

approximator Cλ

, dynamics model M' 2: λ(D) {

i=1 3: for ndynamics iterations do 4: optimize (9) with batch from D 5: end for 6: for nlabel iterations do 7: optimize (7) with batch from λ(D) 8: end for 9: for npolicy iterations do 10: B { ncollect trajectories using M' and } 11: optimize (8) with batch from λ(D) and B 12: for nenv iterations do 13: env { 1 trajectory using environment and } 14: optimize (9) with env 15: end for 16: end for

is an objective metric to measure the quality of calibration. Our learning approach uses off-the-shelf methods to enforce style-consistency during training. We anticipate several variants of style-consistent policy learning of Algorithm 1 for example, using model-free RL, using environment/model rollouts to ﬁne-tune the labeling function approximator, using style-conditioned policy classes, or using other loss functions to encourage imitation quality. Our experiments in Section 6 establish that our style-consistency loss provides a clear learning signal, that no prior approach directly enforces this consistency, and that our approach accomplishes calibration for a combinatorial joint style space.

6. Experiments

We ﬁrst brieﬂy describe our experimental setup and baseline choices, and then discuss our main experimental results. A full description of experiments is available in Appendix C.2

Data. We validate our framework on two datasets: 1) a collection of professional basketball player trajectories with the goal of learning a policy that generates realistic player-movement, and 2) a Cheetah agent running horizontally in Mu Jo Co (Todorov et al., 2012) with the goal of learning a policy with calibrated gaits. The former has a known dynamics function: f(st, at) = st + at, where st and at are the player s position and velocity on the court respectively; we expect the dynamics model M' to easily recover this function. The latter has an unknown dynamics function (which we learn a model of when approximating style-consistency). We obtain Cheetah demonstrations from a collection of policies trained using

2Code is available at: https://github.com/ezhan94/ calibratable-style-consistency.

pytorch-a2c-ppo-acktr (Kostrikov, 2018) to interface with the Deep Mind Control Suite s Cheetah domain (Tassa et al., 2018) see Appendix C for details.

Labeling functions. Labeling functions for Basketball include: 1) average SPEED of the player, 2) DISPLACEMENT from initial to ﬁnal position, 3) distance from ﬁnal position to a ﬁxed DESTINATION on the court (e.g. the basket), 4) mean DIRECTION of travel, and 5) CURVATURE of the trajectory, which measures the player s propensity to change directions. For Cheetah, we have labeling functions for the agent s 1) SPEED, 2) TORSO HEIGHT, 3) BACK-FOOT HEIGHT, and 4) FRONT-FOOT HEIGHT that can be trivially computed from trajectories extracted from the environment.

We threshold the aforementioned labeling functions into categorical labels (leaving real-valued labels for future work) and use (4) for style-consistency with Lstyle as the 0/1 loss. We use cross-entropy for Llabel and list all other hyperparameters in Appendix C.

Metrics. We will primarily study two properties of the learned models in our experiments imitation quality, and style-calibration quality. For measuring imitation quality of generative models, we report the negative log-density term in (2), also known as the reconstruction loss term in VAE literature (Kingma & Welling, 2014; Ha & Eck, 2018), which corresponds to how well the policy can reconstruct trajectories from the dataset.

To measure style-calibration, we report style-consistency results as 1 Lstyle in (4) so that all results are easily interpreted as accuracies. In Section 6.5, we ﬁnd that styleconsistency indeed captures a reasonable notion of calibration when the labeling function is inherently noisy and style-calibration is hard, style-consistency correspondingly decreases. In Section 6.3, we ﬁnd that the goals of imitation (as measured by negative log-density) and calibration (as measured by style-consistency) may not always be aligned investigating this trade-off is an avenue for future work.

Baselines. Our main experiments use TVAEs as the underlying policy class. In Section 6.4, we also experiment with an RNN policy class. We compare our approach, CTVAE-style, with 3 baselines:

1. CTVAE: conditional TVAEs (Wang et al., 2017).

2. CTVAE-info: CTVAE with information factorization

(Creswell et al., 2017), indirectly maximizes styleconsistency by removing correlation of y with z.

3. CTVAE-mi: CTVAE with mutual information maxi-

mization between style labels and trajectories. This is a supervised variant of unsupervised models (Chen et al., 2016b; Li et al., 2017), and also requires learning a dynamics model for sampling policy rollouts.

Learning Calibratable Policies using Programmatic Style-Consistency

Detailed descriptions of baselines are in Appendix A. All baseline models build upon TVAEs, which are also conditioned on a latent variable (see Section 3) and only fundamentally differ in how they encourage the calibration of policies to different style labels. We highlight that the underlying model choice is orthogonal to our contributions; our framework is compatible with other policy models (see Section 6.4).

Model details. We model all trajectory embeddings z as a diagonal Gaussian with a standard normal prior. Encoder qφ and label approximators Cλ

are bi-directional GRUs (Cho et al., 2014) followed by linear layers. Policy is recurrent for basketball, but non-recurrent for Cheetah. The Gaussian log sigma returned by is state-dependent for basketball, but state-independent for Cheetah. For Cheetah, we made these choices based on prior work in Mu Jo Co for training gait policies (Wang et al., 2017). For Basketball, we observed a lot more variation in the 500k demonstrations so we experimented with a more ﬂexible model. See Appendix C for hyperparameters.

6.1. How well can we calibrate policies for single styles?

We ﬁrst threshold labeling functions into 3 classes for Basketball and 2 classes for Cheetah; the marginal distribution p(y) of styles in λ(D) is roughly uniform over these classes. Then we learn a policy calibrated to each of these styles. Finally, we generate rollouts from each of the learned policies to measure style-consistency. Table 1 compares the median style-consistency (over 5 seeds) of learned policies. For Basketball, CTVAE-style signiﬁcantly outperforms baselines and achieves almost perfect styleconsistency for 4 of the 5 styles. For Cheetah, CTVAE-style outperforms all baselines, but the absolute performance is lower than for Basketball we conjecture that this is due to the complex environment dynamics that can be challenging for model-based approaches. Figure 5 in Appendix D shows a visualization of our CTVAE-style policy calibrated for DESTINATION(net).

We also consider cases in which labeling functions can have several classes and non-uniform distributions (i.e. some styles are more/less common than others). We threshold DISPLACEMENT into 6 classes for Basketball and SPEED into 4 classes for Cheetah and compare the policies in Table 2. In general, we observe degradation in overall styleconsistency accuracies as the number of classes increase. However, CTVAE-style policies still consistently achieve better style-consistency than baselines in this setting.

We visualize and compare policies calibrated for 6 classes of DISPLACEMENT in Figure 2. In Figure 2b and 2d, we see that our CTVAE-policy (0.92 style-consistency) is effectively calibrated for styles of low and high displacement, as all trajectories end in the correct corresponding regions

Model Speed Disp. Dest. Dir. Curve CTVAE 83 72 82 77 61 CTVAE-info 84 71 79 72 60 CTVAE-mi 86 74 82 77 72 CTVAE-style 95 96 97 97 81

(a) Style-consistency for labeling functions in Basketball.

Model Speed Torso BFoot FFoot CTVAE 59 63 68 68 CTVAE-info 57 63 65 66 CTVAE-mi 60 65 65 70 CTVAE-style 79 80 80 77

(b) Style-consistency for labeling functions in Cheetah.

Table 1. Individual Style Calibration: Style-consistency ( 10 2, median over 5 seeds) of policies evaluated with 4,000 Basketball and 500 Cheetah rollouts. Trained separately for each style, CTVAE-style policies outperform baselines for all styles in Basketball and Cheetah environments.

Basketball Cheetah Model 2 class

4 class CTVAE 92 83 79 70 45 37 CTVAE-info 90 83 78 70 49 39 CTVAE-mi 92 84 77 70 48 37 CTVAE-style 99 98 96 92 59 51

Table 2. Fine-grained Style-consistency: ( 10 2, median over

5 seeds) Training on labeling functions with more classes (DISPLACEMENT for Basketball, SPEED for Cheetah) yields increasingly ﬁne-grained calibration of behavior. Although CTVAEstyle degrades as the number of classes increases, it outperforms baselines for all styles.

(marked by the green dots). On the other hand, trajectories from a baseline CTVAE model (0.70 style-consistency) in Figure 2a and 2c can sometimes end in the wrong region corresponding to a different style label (marked by red dots). These results suggest that incorporating programmatic styleconsistency while training via (8) can yield good qualitative and quantitative calibration results.

6.2. Can we calibrate for combinatorial joint style

We now consider combinatorial style-consistency as in (6), which measures the style-consistency with respect to all labeling functions simultaneously. For instance, in Figure 3, we calibrate to both terminating close to the net and also the speed at which the agent moves towards the target destination; if either style is not calibrated then the joint style is not calibrated. In our experiments, we evaluated up to 1024 joint styles.

Table 3 compares the style-consistency of policies simultaneously calibrated for up to 5 labeling functions for Basketball and 3 labeling functions for Cheetah. This is a very difﬁcult task, and we see that style-consistency for base-

Learning Calibratable Policies using Programmatic Style-Consistency

(a) Label class 0 (slow)

(b) Label class 1 (mid)

(c) Label class 2 (fast)

Figure 3. CTVAE-style rollouts

calibrated for 2 styles: label class 1 of DESTINATION(net) (see

Figure 5 in Appendix D) and each class for SPEED, with 0.93 style-consistency. Diamonds ( ) and dots ( ) indicate initial and ﬁnal positions.

2 style 3 style 4 style 5 style 5 style 3 class 3 class 3 class 3 class 4 class Model (8) (27) (81) (243) (1024) CTVAE 71 58 50 37 21 CTVAE-info 69 58 51 32 21 CTVAE-mi 72 56 51 30 21 CTVAE-style 93 88 88 75 55

(a) Style-consistency for labeling functions in Basketball.

2 style 3 style 2 class 2 class Model (4) (8) CTVAE 41 28 CTVAE-info 41 27 CTVAE-mi 40 28 CTVAE-style 54 40

(b) Style-consistency for labeling functions in Cheetah.

Table 3. Combinatorial Style-consistency: ( 10 2, median over 5 seeds) Simultaneously calibrated to joint styles from multiple labeling functions, CTVAE-style policies signiﬁcantly outperform all baselines. The number of distinct style combinations are in brackets. The most challenging experiment for basketball calibrates for 1024 joint styles (5 labeling functions, 4 classes each), in which CTVAE-style has a +161% improvement in styleconsistency over the best baseline.

lines degrades signiﬁcantly as the number of joint styles grows combinatorially. On the other hand, our CTVAEstyle approach experiences only a modest decrease in styleconsistency and is still signiﬁcantly better calibrated (0.55 style-consistency vs. 0.21 best baseline style-consistency in the most challenging experiment for Basketball). We visualize a CTVAE-style policy calibrated for two styles in Basketball with style-consistency 0.93 in Figure 3. CTVAEstyle outperforms baselines in Cheetah as well, but there is still room for improvement to optimize style-consistency better in future work.

6.3. What is the trade-off between style-consistency

and imitation quality?

In Table 4, we investigate whether CTVAE-style s superior style-consistency comes at a signiﬁcant cost to imitation quality, since we optimize both in (5). For Basketball, high style-consistency is achieved without any degradation in im-

Basketball Cheetah Model DKL NLD DKL NLD TVAE 2.55 -7.91 29.4 -0.60 CTVAE 2.51 -7.94 29.3 -0.59 CTVAE-info 2.25 -7.91 29.1 -0.58 CTVAE-mi 2.56 -7.94 28.5 -0.57 CTVAE-style 2.27 -7.83 30.1 -0.28 Table 4. KL-divergence and negative log-density per timestep for TVAE models (lower is better). CTVAE-style is comparable to baselines for Basketball, but is slightly worse for Cheetah.

Style-consistency " Model Min Median Max NLD # RNN 79 80 81 -7.7 RNN-style 81 91 98 -7.6

Table 5. Style-consistency of RNN policy model (10 2, 5 seeds)

for DESTINATION in basketball. Our approach improves styleconsistency without signiﬁcantly decreasing imitation quality.

itation quality. For Cheetah, negative log-density is slightly worse; a followup experiment in Table 13 in Appendix D shows that we can improve imitation quality with more training, sometimes with modest decrease to style-consistency.

6.4. Is our framework compatible with other policy

classes for imitation?

We highlight that our framework introduced in Section 5 is compatible with any policy class. In this experiment, we optimize for style-consistency using a simpler model for the policy and show that style-consistency is still improved. In particular, we use an RNN and calibrate for DESTINATION in basketball. In Table 5, we see that style-consistency is improved for the RNN model without any signiﬁcant decrease in imitation quality.

6.5. What if labeling functions are noisy?

So far, we have demonstrated that our method optimizing for style-consistency directly can learn policies that are much better calibrated to styles, without a signiﬁcant degradation in imitation quality. However, we note that the labeling functions used thus far are assumed to be perfect, in that they capture exactly the style that we wish to calibrate. In practice, domain experts may specify labeling functions that are noisy; we simulate that scenario in this experiment.

Learning Calibratable Policies using Programmatic Style-Consistency

Figure 4. Relative change of style-consistency for CTVAE-style

policies trained with noisy labeling functions, which are created by injecting noise with mean 0 and standard deviation c σ for c 2 {1, 2, 3, 4} before applying thresholds to obtain label classes. The x-axis is the label disagreement between noisy and true labeling functions. The y-axis is the median change (5 seeds) in styleconsistency using the true labeling functions without noise, relative to Table 1. The relationship is generally linear and better than a one-to-one dependency (i.e. if X% label disagreement leads to X% relative change, indicated by the black line). See Table 17 and 18 in the Appendix D for more details.

In particular, we create noisy versions of labeling functions in Table 1 by adding Gaussian noise to computed values before applying the thresholds. The noise will result in some label disagreement between noisy and true labeling functions (Table 17 in Appendix D). This resembles the scenario in practice where domain experts can mislabel a trajectory, or have disagreements. We train CTVAE-style models with noisy labeling functions and compute style-consistency using the true labeling functions without noise. Intuitively, we expect the relative decrease in style-consistency to scale linearly with the label disagreement.

Figure 4 shows that the median relative decrease in styleconsistency of our CTVAE-models scales linearly with label disagreement. Our method is also somewhat robust to noise, as X% label disagreement results in better than X% relative decrease in style-consistency (black line in Figure 4). Directions for future work include combining multiple noisy labeling functions together to improve style-consistency with respect to a true labeling function.

7. Conclusion and Future Work

We propose a novel framework for imitating diverse behavior styles while also calibrating to desired styles. Our framework leverages labeling functions to tractably represent styles and introduces programmatic style-consistency, a metric that allows for fair comparison between calibrated policies. Our experiments demonstrate strong empirical calibration results.

We believe that our framework lays the foundation for many directions of future research. First, can one model more

complex styles not easily captured with a single labeling function (e.g. aggressive vs. passive play in sports) by composing simpler labeling functions (e.g. max speed, distance to closest opponent, number of fouls committed, etc.), similar to (Ratner et al., 2016; Bach et al., 2017)? Second, can we use these per-timestep labels to model transient styles, or simplify the credit assignment problem when learning to calibrate? Third, can we blend our programmatic supervision with unsupervised learning approaches to arrive at effective semi-supervised solutions? Fourth, can we use model-free approaches to further optimize self-consistency, e.g., to ﬁne-tune from our model-based approach? Finally, can we integrate our framework with reinforcement learning to also optimize for environmental rewards?

Acknowledgements

This research is supported in part by NSF #1564330, NSF #1918655, DARPA PAI, and gifts from Intel, Activision/Blizzard and Northrop Grumman. Basketball dataset was provided by STATS.

Bach, S. H., He, B. D., Ratner, A., and R e, C. Learning the struc-

ture of generative models without labeled data. In International Conference on Machine Learning (ICML), 2017.

Bao, J., Chen, D., Wen, F., Li, H., and Hua, G. CVAE-GAN:

ﬁne-grained image generation through asymmetric training. In IEEE International Conference on Computer Vision (ICCV), 2017.

Bengio, Y., Courville, A. C., and Vincent, P. Unsupervised feature

learning and deep learning: A review and new perspectives. ar Xiv preprint ar Xiv:1206.5538, 2012.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp,

B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016.

Branson, K., Robie, A. A., Bender, J., Perona, P., and Dickinson,

M. H. High-throughput ethomics in large groups of drosophila. Nature methods, 6(6):451, 2009.

Broll, B., Hausknecht, M., Bignell, D., and Swaminathan, A. Cus-

tomizing scripted bots: Sample efﬁcient imitation learning for human-like behavior in minecraft. AAMAS Workshop on Adaptive and Learning Agents, 2019.

Chang, M.-F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett,

A., Wang, D., Carr, P., Lucey, S., Ramanan, D., et al. Argoverse: 3d tracking and forecasting with rich maps. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Chen, J., Le, H. M., Carr, P., Yue, Y., and Little, J. J. Learning

online smooth predictors for realtime camera planning using recurrent decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4688 4696, 2016a.

Learning Calibratable Policies using Programmatic Style-Consistency

Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating

sources of disentanglement in variational autoencoders. In Neural Information Processing Systems (Neur IPS), 2018.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and

Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems (Neur IPS), 2016b.

Cho, K., van Merrienboer, B., G ulc ehre, C ., Bougares, F.,

Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014.

Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P.,

and Levine, S. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. In International Conference on Machine Learning (ICML), 2018.

Creswell, A., Bharath, A. A., and Sengupta, B. Adversarial infor-

mation factorization. ar Xiv preprint ar Xiv:1711.05175, 2017.

Ding, Y., Florensa, C., Abbeel, P., and Phielipp, M. Goalconditioned imitation learning. In Neural Information Processing Systems (Neur IPS), 2019.

Eyjolfsdottir, E., Branson, S., Burgos-Artizzu, X. P., Hoopfer,

E. D., Schor, J., Anderson, D. J., and Perona, P. Detecting social actions of fruit ﬂies. In European Conference on Computer Vision, pp. 772 787. Springer, 2014.

Eyjolfsdottir, E., Branson, K., Yue, Y., and Perona, P. Learning

recurrent representations for hierarchical behavior modeling. In International Conference on Learning Representations (ICLR), 2017.

Ha, D. and Eck, D. A neural representation of sketch drawings. In

International Conference on Learning Representations (ICLR), 2018.

Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G. S., and

Lim, J. J. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Neural Information Processing Systems (Neur IPS), 2017.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick,

M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

Ho, J. and Ermon, S. Generative adversarial imitation learning. In

Advances in neural information processing systems, pp. 4565 4573, 2016.

Hofmann, K. Minecraft as ai playground and laboratory. In Proceedings of the Annual Symposium on Computer-Human Interaction in Play, pp. 1 1, 2019.

Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. To-

ward controlled generation of text. In International Conference on Machine Learning (ICML), 2017.

Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and

Datta, S. R. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, 2016.

Kim, H. and Mnih, A. Disentangling by factorising. In Interna-

tional Conference on Machine Learning (ICML), 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In

International Conference on Learning Representations (ICLR), 2014.

Klys, J., Snell, J., and Zemel, R. S. Learning latent subspaces

in variational autoencoders. In Neural Information Processing Systems (Neur IPS), 2018.

Kostrikov, I. Pytorch implementations of reinforcement learn-

ing algorithms. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr-gail, 2018.

Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., and Leibe, B. The

atari grand challenge dataset. ar Xiv preprint ar Xiv:1705.10998, 2017.

Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L.,

and Ranzato, M. Fader networks: Manipulating images by sliding attributes. In Neural Information Processing Systems (Neur IPS), 2017.

Le, H. M., Carr, P., Yue, Y., and Lucey, P. Data-driven ghosting

using deep imitation learning. In MIT Sloan Sports Analytics Conference (SSAC), 2017a.

Le, H. M., Yue, Y., Carr, P., and Lucey, P. Coordinated multi-agent

imitation learning. In International Conference on Machine Learning (ICML), 2017b.

Le, H. M., Jiang, N., Agarwal, A., Dud ık, M., Yue, Y., and

Daum e III, H. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning (ICML), 2018.

Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation

learning from visual demonstrations. In Neural Information Processing Systems (Neur IPS), 2017.

Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion convolutional

recurrent neural network: Data-driven trafﬁc forecasting. In International Conference on Learning Representations (ICLR), 2018.

Ling, H. Y., Zinno, F., Cheng, G., and van de Panne, M. Character

controllers using motion vaes. In ACM Conference on Graphics (SIGGRAPH), 2020.

Locatello, F., Bauer, S., Lucic, M., Gelly, S., Sch olkopf, B., and

Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (ICML), 2019.

Losey, D. P., Srinivasan, K., Mandlekar, A., Garg, A., and Sadigh,

D. Controlling assistive robots with learned latent actions. In International Conference on Robotics and Automation (ICRA), 2020.

Luo, W., Yang, B., and Urtasun, R. Fast and furious: Real time

end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic

framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations (ICLR), 2019.

Learning Calibratable Policies using Programmatic Style-Consistency

Luo, Y.-S., Soeseno, J. H., Chen, T. P.-C., and Chen, W.-C. Carl:

Controllable agent with reinforcement learning for quadruped locomotion. In ACM Conference on Graphics (SIGGRAPH), 2020.

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural

network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. In International Conference on Robotics and Automation (ICRA), 2018.

Nishimura, T., Hashimoto, A., Yamakata, Y., and Mori, S. Frame

selection for producing recipe with pictures from an execution video of a recipe. In Proceedings of the 11th Workshop on Multimedia for Cooking and Eating Activities, pp. 9 16. ACM, 2019.

Ratner, A., Sa, C. D., Wu, S., Selsam, D., and R e, C. Data programming: Creating large training sets, quickly. In Neural Information Processing Systems (Neur IPS), 2016.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation

learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pp. 627 635, 2011.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov,

O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Sharma, A., Gu, S., Levine, S., Kumar, V., and Hausman, K.

Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations (ICLR), 2020.

Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman,

I. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.

Syed, U. and Schapire, R. E. A game-theoretic approach to appren-

ticeship learning. In Advances in neural information processing systems, pp. 1449 1456, 2008.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D.,

Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T. P., and Riedmiller, M. A. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018.

Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez,

A. G., Hodgins, J., and Matthews, I. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4):93, 2017.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for

model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.

Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N., and Heess,

N. Robust imitation of diverse behaviors. In Neural Information Processing Systems (Neur IPS), 2017.

Yeh, R. A., Schwing, A. G., Huang, J., and Murphy, K. Diverse

generation for multi-agent sports games. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Zhan, E., Zheng, S., Yue, Y., Sha, L., and Lucey, P. Generat-

ing multi-agent trajectories using programmatic weak supervision. In International Conference on Learning Representations (ICLR), 2019.

Zheng, S., Yue, Y., and Hobbs, J. Generating long-term trajecto-

ries using deep hierarchical networks. In Neural Information Processing Systems (Neur IPS), 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-

image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017.