# unsupervised_perceptual_rewards_for_imitation_learning__9b72f83a.pdf

Under review as a conference paper at ICLR 2017

UNSUPERVISED PERCEPTUAL REWARDS FOR IMITATION LEARNING

Pierre Sermanet, Kelvin Xu & Sergey Levine Google Brain {sermanet,kelvinxx,slevine}@google.com

Reward function design and exploration time are arguably the biggest obstacles to the deployment of reinforcement learning (RL) agents in the real world. In many real-world tasks, designing a suitable reward function takes considerable manual engineering and often requires additional and potentially visible sensors to be installed just to measure whether the task has been executed successfully. Furthermore, many interesting tasks consist of multiple steps that must be executed in sequence. Even when the ﬁnal outcome can be measured, it does not necessarily provide useful feedback on these implicit intermediate steps or sub-goals. To address these issues, we propose leveraging the abstraction power of intermediate visual representations learned by deep models to quickly infer perceptual reward functions from small numbers of demonstrations. We present a method that is able to identify the key intermediate steps of a task from only a handful of demonstration sequences, and automatically identify the most discriminative features for identifying these steps. This method makes use of the features in a pre-trained deep model, but does not require any explicit sub-goal supervision. The resulting reward functions, which are dense and smooth, can then be used by an RL agent to learn to perform the task in real-world settings. To evaluate the learned reward functions, we present qualitative results on two real-world tasks and a quantitative evaluation against a human-designed reward function. We also demonstrate that our method can be used to learn a complex real-world door opening skill using a real robot, even when the demonstration used for reward learning is provided by a human using their own hand. To our knowledge, these are the ﬁrst results showing that complex robotic manipulation skills can be learned directly and without supervised labels from a video of a human performing the task.

1 INTRODUCTION

Social learning, such as imitation, plays a critical role in allowing humans and animals to quickly acquire complex skills in the real world. Humans can use this weak form of supervision to acquire behaviors from very small numbers of demonstrations, in sharp contrast to deep reinforcement learning (RL) methods, which typically require extensive training data. In this work, we make use of two ideas to develop a scaleable and efﬁcient imitation learning method: ﬁrst, imitation makes use of extensive prior knowledge to quickly glean the gist of a new task from even a small number of demonstrations; second, imitation involves both observation and trial-and-error learning (RL). Building on these ideas, we propose a reward learning method for understanding the intent of a user demonstration through the use of pre-trained visual features, which provide the prior knowledge for efﬁcient imitation. Our algorithm aims to discover not only the high-level goal of a task, but also the implicit sub-goals and steps that comprise more complex behaviors. Extracting such sub-goals can allow the agent to make maximal use of the information contained in the demonstration. Once the reward function has been extracted, the agent can use its own experience at the task to determine the physical structure of the behavior, even when the reward is provided by an agent with a substantially different embodiment (e.g. a human providing a demonstration for a robot).

Work done as part of the Google Brain Residency program (g.co/brainresidency).

Under review as a conference paper at ICLR 2017

Few demonstrations

Unsupervised discovery of intermediate steps

Feature selection maximizing step discrimination across all videos

pretrained deep model (e.g. Inception)

general high-level features

Demonstrator (human or robot)

Real-time perceptual reward for multiple intermediate steps REAL ROBOT

OFFLINE COMPUTATION

Learning agent with Reinforcement Learning

Figure 1: Method overview. Given a few demonstration videos of the same action, our method discovers intermediate steps, then trains a classiﬁer for each step on top of the mid and high-level representations of a pre-trained deep model (in this work, we use all activations starting from the ﬁrst mixed layer that follows the ﬁrst 5 convolutional layers). The step classiﬁers are then combined to produce a single reward function per step. These intermediate rewards are combined into a single reward function. The reward function is then used by a real robot to learn the perform the demonstrated task as show in 3.2.

To our knowledge, our method is the ﬁrst reward learning technique that learns generalizable visionbased reward functions for complex robotic manipulation skills from only a few demonstrations provided directly by a human. Although prior methods have demonstrated reward learning with vision for real-world robotic tasks, they have either required kinesthetic demonstrations with robot state for reward learning (Finn et al., 2015), or else required low-dimensional state spaces and numerous demonstrations (Wulfmeier et al., 2016). The contributions of this paper are:

A method for perceptual reward learning from only a few demonstrations of real-world tasks. Reward functions are dense and incremental, with automated unsupervised discovery of intermediate steps. The ﬁrst vision-based reward learning method that can learn a complex robotic manipulation task from a few human demonstrations in real-world robotic experiments. A set of empirical experiments that show that the learned visual representations inside a pre-trained deep model are general enough to be directly used to represent goals and subgoals for manipulation skills in new scenes without retraining.

1.1 RELATED WORK

Deep reinforcement learning and deep robotic learning work has previously examined learning reward functions based on images. One of the most common approaches to image-based reward functions is to directly specify a target image by showing the learner the raw pixels of a successful task completion state, and then using distance to that image (or its latent representation) as a reward function (Lange et al., 2012; Finn et al., 2015; Watter et al., 2015). However, this approach

Under review as a conference paper at ICLR 2017

has several severe shortcomings. First, the use of a target image presupposes that the system can achieve a substantially similar visual state, which precludes generalization to semantically similar but visually distinct situations. Second, the use of a target image does not provide the learner with information about which facet of the image is more or less important for task success, which might result in the learner excessively emphasizing irrelevant factors of variation (such as the color of a door due to light and shadow) at the expense of relevant factors (such as whether or not the door is open or closed). Analyzing a collection of demonstrations to learn a parsimonious reward function that explains the demonstrated behavior is known as inverse reinforcement learning (IRL) (Ng et al., 2000). A few recently proposed IRL algorithms have sought to combine IRL with vision and deep network representations (Finn et al., 2016b; Wulfmeier et al., 2016). However, scaling IRL to high-dimensional systems and open-ended reward representations is very challenging. The previous work closest to ours used images together with robot state information (joint angles and end effector pose), with tens of demonstrations provided through kinesthetic teaching (Finn et al., 2016b). The approach we propose in this work which can be interpreted as a simple and efﬁcient approximation to IRL, can use demonstrations that consist of videos of a human performing the task using their own body, and can acquire reward functions with intermediate sub-goals using just a few examples. This kind of efﬁcient vision-based reward learning from videos of humans has not been demonstrated in prior IRL work. The idea of perceptual reward functions using raw pixels was also explored by Edwards et al. (2016) which, while sharing the same spirit as this work, was limited to simple synthetic tasks and used single images as perceptual goals rather than multiple demonstration videos.

2 SIMPLE INVERSE REINFORCEMENT LEARNING WITH VISUAL FEATURES

The key insight in our approach is that we can exploit the semantically meaningful and powerful features in a pre-trained deep neural network to infer task goals and sub-goals using a very simple approximate inverse reinforcement learning method. The pre-trained network effectively transfers prior knowledge about the visual world to make imitation learning fast and robust. Our approach can be interpreted as a simple approximation to inverse reinforcement learning under a particular choice of system dynamics, as discussed in Section 2.1. While this approximation is somewhat simplistic, it affords an efﬁcient and scaleable learning rule that avoids overﬁtting even when trained on a small number of demonstrations. As depicted in Fig. 1, our algorithm ﬁrst segments the demonstrations into segments based on perceptual similarity, as described in Section 2.2. Intuitively, the resulting segments correspond to sub-goals or steps of the task. The segments can then be used as a supervision signal for learning steps classiﬁers, described in Section 2.3, which produces a single perception reward function for each step of the task. The combined reward function can then be used with a reinforcement learning algorithm to learn the demonstrated behavior. Although this method for extracting reward functions is exceedingly simple, its power comes from the use of highly general and robust pre-trained visual features, and our key empirical result is that such features are sufﬁcient to acquire effective and generalizable reward functions for real-world manipulation skills.

We use the Inception network (Szegedy et al., 2015) pre-trained for Image Net classiﬁcation (Deng et al., 2009) to obtain the visual features for representing the learned rewards. It is well known that visual features in such networks are quite general and can be reused for other visual tasks. However, it is less clear if sparse subsets of such features can be used directly to represent goals and subgoals for real-world manipulation skills. Our experimental evaluation suggests that indeed they can, and that the resulting reward representations are robust and reliable enough for real-world robotic learning without any ﬁnetuning of the features. In this work, we use all activations starting from the ﬁrst mixed layer that follows the ﬁrst 5 convolutional layers (this layer s activation map is of size 35x35x256 given a 299x299 input). While this paper focuses on visual perception, the approach is general and can be applied to other modalities (e.g. audio and tactile).

2.1 INVERSE REINFORCEMENT LEARNING WITH TIME-INDEPENDENT GAUSSIAN MODELS

Inverse reinforcement learning can be performed with a variety of algorithms (Ng et al., 2000), ranging from margin-based methods (Abbeel & Ng, 2004; Ratliff et al., 2006) to methods based on probabilistic models (Ramachandran & Amir, 2007; Ziebart et al., 2008). In this work, we use a very simple approximation to the Max Ent IRL model (Ziebart et al., 2008), a popular probabilistic approach to IRL. We will use st to denote the visual feature activations at time t, which constitute the state, sit to denote the ith feature at time t, and τ = {s1, . . . , s T } to denote a sequence or trajectory of these activations in a video. In Max Ent IRL, the demonstrated trajectories τ are assumed to be

Under review as a conference paper at ICLR 2017

drawn from a Boltzmann distribution according to:

p(τ) = p(s1, . . . , s T ) = 1

where R(st) is the unknown reward function. The principal computational challenge in Max Ent IRL is to approximate Z, since the states at each time step are not independent, but are constrained by the system dynamics. In deterministic systems, where st+1 = f(st, at) for actions at and dynamics f, the dynamics impose constraints on which trajectories τ are feasible. In stochastic systems, where st+1 p(st+1|st, at), we must also account for the dynamics distribution in Equation (1), as discussed by Ziebart et al. (2008). Prior work has addressed this using dynamic programming to exactly compute Z in small, discrete systems (Ziebart et al., 2008), or by using sampling to estimate Z for large, continuous domains (Kalakrishnan et al., 2010; Boularias et al., 2011; Finn et al., 2016b). Since our state representation corresponds to a large vector of visual features, exact dynamic programming is infeasible. Sample-based approximation requires running a large number of trials to estimate Z and, as shown in recent work (Finn et al., 2016a), corresponds to a variant of generative adversarial networks (GANs), with all of the accompanying stability and optimization challenges. Furthermore, the corresponding model for the reward function is complex, making it prone to overﬁtting when only a small number of demonstrations is available.

When faced with a difﬁcult learning problem in extremely low-data regimes, a standard solution is to resort to simple, biased models, so as to minimize overﬁtting. We adopt precisely this approach in our work: instead of approximating the complex posterior distribution over trajectories under nonlinear dynamics, we use a simple biased model that affords efﬁcient learning and minimizes overﬁtting. Speciﬁcally, we assume that all trajectories are dynamically feasible, and that the distribution over each activation at each time step is independent of all other activations and all other time steps. This corresponds to the IRL equivalent of a na ıve Bayes model: in the same way that na ıve Bayes uses an independence assumption to mitigate overﬁtting in high-dimensional feature spaces, we use independence between both time steps and features to learn from very small numbers of demonstrations. Under this assumption, the probability of a trajectory τ factorizes according to

i=1 p(sit) =

1 Zit exp Ri(sit) ,

which corresponds to a reward function of the form Rt(st) = PN i=1 Ri(sit). We can then simply choose a form for Ri(sit) that can be normalized analytically, which in our case is a quadratic in sit, such that exp(Ri(sit))/Zit is a Gaussian distribution, and the original trajectory distribution is a na ıve Bayes model. While this approximation is quite drastic, it yields an exceedingly simple learning rule: in the most basic version, we have only to ﬁt the mean and variance of each feature distribution, and then use the log of the resulting Gaussian as the reward.

2.2 DISCOVERY OF INTERMEDIATE STEPS

The simple IRL model in the previous section can be used to acquire a single quadratic reward function in terms of the visual features st. However, for complex multi-stage tasks, this model can be too coarse, making task learning slow and difﬁcult. We therefore instead ﬁt multiple quadratic reward functions, with one reward function per intermediate step or goal. These steps are discovered automatically in the ﬁrst stage of our method, which is performed independently on each demonstration. If multiple demonstrations are available, they are pooled together in the feature selection step discussed in the next section, and could in principle be combined at the segmentation stage as well, though we found this to be unnecessary in our prototype. The intermediate steps model extends the simple independent Gaussian model in the previous section by assuming that

1 Zit exp Rigt(sit) ,

where gt is the index of the goal or step corresponding to time step t. Learning then corresponds to identifying the boundaries of the steps in the demonstration, and ﬁtting independent Gaussian feature distributions at each step. Note that this corresponds exactly to segmenting the demonstration such that the variance of each feature within each segment is minimized.

Under review as a conference paper at ICLR 2017

In this work, we employ a simple recursive video segmentation algorithm as described in Algorithm 1. Intuitively, this method breaks down a sequence in a way that each frame in a segment is abstractly similar to each other frame in that segment. The number of segments is provided manually in this approach, though it would be straightforward to also utilize standard model selection criteria for choosing this number automatically. There exists a body of unsupervised video segmentation methods Yuan et al. (2007) which would likely enable a less constrained set of demonstrations to be used. While this is an important avenue of future work, we show that our simple approach is sufﬁcient to demonstrate the efﬁcacy of our method on a realistic set of demonstrations. We also investigate how to reduce the search space of similar feature patterns across videos in section 2.3. This would render discovery of video alignments tractable for an optimization method such as the one used in Joulin et al. (2014) for video co-localization.

The complexity of Algorithm 1 is O(nm) where n is the number of frames in a sequence and m the number of splits. Note that dynamic programming is not applicable to this algorithm because each sub-problem, i.e. how to split a sequence after the ith frame, depends on the segmentation chosen before the ith frame. We also experiment with a greedy binary version of this algorithm (Algorithm 2 detailed in section A.1): ﬁrst split the entire sequence in two, then recursively split each new segment in two. While not exactly minimizing the variance across all segments, it is signiﬁcantly more efﬁcient (O(n2 log m)) and yields qualitatively sensible results.

Algorithm 1 Recursive similarity maximization, where Average Std() is a function that computes the average standard deviation over a set of frames or over a set of values, Join() is a function that joins values or lists together into a single list, n is the number of splits desired and min size is the minimum size of a split.

function SPLIT(video, start, end, n, min size, prev std = [])

if n = 1 then return [], [AVERAGESTD(video[start : end])] end if min std None min std list [] min split [] for i start + min size to end ((n 1) min size)) do

std1 [AVERAGESTD(video[start : i])] splits2, std2 SPLIT(video, i, end, n 1, min size, prev std + std1) avg std AVERAGESTD(JOIN(prev std, std1, std2)) if min std = None or avg std < min std then

min std avg std min std list JOIN(std1, std2) min split JOIN(i, splits2) end if end for return min split, min std list end function

2.3 STEPS CLASSIFICATION

In this section we explore learning a steps classiﬁer on top of the pre-trained deep model using a regular linear classiﬁer and a custom feature selection classiﬁer.

Intent understanding requires identifying highly discriminative features of a speciﬁc goal while remaining invariant to unrelated variation (e.g. lighting, color, viewpoint). The relevant discriminative features may be very diverse and more or less abstract, which motivates our intuition to tap into the activations of deep models at different depths. Deep models cover a large set of representations that can be useful, from spatially dense and simple features in the lower layers (e.g. large collection of detected edges) to gradually more spatially sparse and abstract features (e.g. few object classes).

We train a simple linear layer which takes as input the same mid to high level activations used for steps discovery and outputs a score for each step. This linear layer is trained using logistic regression and the underlying deep model is not ﬁne-tuned. Given the large input (1,453,824 units) and the low data regime (11 to 19 videos of 30 to 50 frames each), we hypothesize that this model should severely overﬁt to the training data and perform poorly in validation and testing. We test and discuss that hypothesis in Section 3.1.2.

Under review as a conference paper at ICLR 2017

We also hypothesize that there exists a small subset of mid to high-level features that are sparse independent and can readily and compactly discriminate previously unseen inputs. We investigate that hypothesis using a simple feature selection method described in Appendix A.3. The existence of a small subset of discriminative features can be useful for reducing overﬁtting in low data regimes, but more importantly can allow drastic reduction of the search space for the unsupervised steps discovery. Indeed since each frame is described by millions of features, ﬁnding patterns of feature correlations across videos leads to a combinatorial explosion. However, the problem may become tractable if there exists a low-dimensional subset of features that leads to reasonably accurate steps classiﬁcation. We test and discuss that hypothesis in Section 3.1.2.

2.4 USING PERCEPTUAL REWARDS FOR ROBOTIC LEARNING

In order to use our learned perceptual reward functions in a complete skill learning system, we must also choose a reinforcement learning algorithm and a policy representation. While in principle any reinforcement learning algorithm could be suitable for this task, we chose a method that is efﬁcient for evaluation on real-world robotic systems in order to validate our approach. The method we use is based on the PI2 reinforcement learning algorithm (Theodorou et al., 2010). Our implementation, which is discussed in more detail in Appendix A.4, uses a relatively simple linear-Gaussian parameterization of the policy, which corresponds to a sequence of open-loop torque commands with ﬁxed linear feedback to correct for perturbations. This method also requires initialization from example demonstrations to learn complex manipulation tasks efﬁciently. A more complex neural network policy could also be used (Chebotar et al., 2016), and more sophisticated RL algorithms could also learn skills without demonstration initialization. However, since the main purpose of this component is to validate the learned reward functions, we used this simple approach to test our rewards quickly and efﬁciently.

3 EXPERIMENTS

In this section, we discuss our empirical evaluation, starting with an analysis of the learned reward functions in terms of both qualitative reward structure and quantitative segmentation accuracy. We then present results for a real-world validation of our method on robotic door opening.

3.1 PERCEPTUAL REWARDS EVALUATION

We report results on two demonstrated tasks: door opening and liquid pouring. We collected about a dozen training videos for each task using a smart phone. As an example, Fig. 2 shows the entire training set used for the pouring task.

Figure 2: Entire training set for the pouring task (11 demonstrations).

3.1.1 QUALITATIVE ANALYSIS

While a door opening sensor can be engineered using sensors hidden in the door, measuring pouring or container tilting would be quite complicated, would visually alter the scene, and is unrealistic for learning in the wild. Visual reward functions are therefore an excellent choice for complex physical phenomena such as liquid pouring. In Fig. 3, we present the combined reward functions for test

Under review as a conference paper at ICLR 2017

videos on the pouring task, and Fig. 10 shows the intermediate rewards for each sub-goal. We plot the predicted reward functions for both successful and failed task executions in Fig. 11. We observe that for missed executions where the task is only partially performed, the intermediate steps are correctly classiﬁed. Fig. 9 details qualitative results of unsupervised step segmentation for the door opening and pouring tasks. For the door task, the 2-segments splits are often quite in line with what one can expect, while a 3-segments split is less accurate. We also observe that the method is robust to the presence or absence of the handle on the door, as well as its opening direction. We ﬁnd that for the pouring task, the 4-segments split often yields the most sensible break down. It is interesting to note that the 2-segment split usually occurs when the glass is about half full.

Failure Cases

The intermediate reward function for the door opening task which corresponds to a human hand manipulating the door handle seems rather noisy or wrong in 10b, 10c and 10e ( action1 on the y-axis of the plots).The reward function in 11f remains ﬂat while liquid is being poured into the glass. The liquid being somewhat transparent, we suspect that it looks too similar to the transparent glass for the function to ﬁre.

Figure 3: Examples of pouring reward functions. We show here a few successful examples, see Fig. 11 for results on the entire test set. In 3a we observe a continuous and incremental reward as the task progresses and saturating as it is completed. 3b increases as the bottle appears but successfully detects that the task is not completed, while in 3c it successfully detects that the action is already completed from the start.

3.1.2 QUANTITATIVE ANALYSIS

We evaluate the quantitative accuracy of the unsupervised steps discovery in Table 1, while Table 2 presents quantitative generalization results for the learned reward on a test video of each task. For each video, ground truth intermediate steps were provided by human supervision for the purpose of evaluation. While this ground truth is subjective, since each task can be broken down in multiple ways, it is consistent for the simple tasks in our experiments. We use the Jaccard similarity measure (intersection over union) to indicate how much a detected step overlaps with its corresponding ground truth.

Table 1: Unsupervised steps discovery accuracy (Jaccard overlap on training sets) versus the ordered random steps baseline.

dataset method 2 steps 3 steps (training) step 1 step 2 average step 1 step 2 step 3 average door ordered random steps 59.4% 45.6% 52.5% 48.0% 58.1% 60.1% 55.4% unsupervised steps 84.0% 68.1% 76.1% 57.6% 75.1% 68.1% 66.9% pouring ordered random steps 65.2% 66.6% 65.9% 46.2% 46.3% 66.3% 52.9% unsupervised steps 92.3% 90.5% 91.6% 79.7% 48.0% 48.6% 58.8%

In Table 1, we compare our method against a random baseline. Because we assume the same step order in all demonstrations, we also order the random steps in time to provide a fair baseline. Note that the random baseline performs fairly well because the steps are distributed somewhat uniformly in time. Should the steps be much less temporally uniform, the random baseline would be expected to perform very poorly, while our method should maintain similar performance. We compare splitting between 2 and 3 steps and ﬁnd that, for both tasks, 2 steps are easier to discover, probably because these tasks exhibit one strong visual change each while the other steps are more subtle. Note that our unsupervised segmentation only works when full sequences are available while our learned reward functions can be used in real-time without accessing future frames. Hence in these experiments we

Under review as a conference paper at ICLR 2017

evaluate the unsupervised segmentation on the training set only and evaluate the reward functions on the test set.

Table 2: Reward functions accuracy by steps (Jaccard overlap on test sets).

dataset (testing) classiﬁcation method 2 steps average 3 steps average door random baseline 33.6% 1.6 25.5% 1.6 feature selection 72.4% 0.0 52.9% 0.0 linear classiﬁer 75.0% 5.5 53.6% 4.7 pouring random baseline 31.1% 3.4 25.1% 0.1 feature selection 65.4% 0.0 40.0% 0.0 linear classiﬁer 69.2% 2.0 49.6% 8.0

In Table 2, we evaluate the reward functions individually for each step on the test set. For that purpose, we binarize the reward function using a threshold of 0.5. The random baseline simply outputs true or false at each timestep. We observe that the learned feature selection and linear classiﬁer functions outperform the baseline by about a factor of 2. It is not clear exactly what the minimum level of accuracy is required to successfully learn to perform these tasks, but we show in section 3.2.2 that the reward accuracy on the door task is sufﬁcient to reach 100% success rate with a real robot. Individual steps accuracy details can be found in Table 3.

Surprisingly, the linear classiﬁer performs well and does not appear to overﬁt on our relatively small training set. Although the feature selection algorithm is rather close to the linear classiﬁer compared to the baseline, using feature selection to avoid avoiding is not neccesary. However the idea that a small subset of features (32 in this case) can lead to reasonable classiﬁcation accuracy is veriﬁed and an important piece of information for drastically reducing the search space for future work in unsupervised steps discovery. Additionally, we show in Fig. 4 that the feature selection approach works well when the number of features n is in the region [32, 64] but collapses to 0% accuracy when n > 8192.

Figure 4: Feature selection classiﬁcation accuracy on the pouring validation set for 2-steps classiﬁcation. By varying the number of features n selected, we show that the method yields good results in the region n = [32, 64], but collapses to 0% accuracy starting at n = 8192.

3.2 REAL-WORLD ROBOTIC DOOR OPENING

In this section, we aim to answer the question of whether our previously visualized reward function can be used to learn a real-world robotic motion skill. We experiment on a door opening skill, where we adapt a demonstrated door opening to a novel conﬁguration, such as different position or orientation of the door. Following the experimental protocol in prior work (Chebotar et al., 2016), we adapt an imperfect kinesthetic demonstration which we ensure succeeds at least occasionally (about 10% of the time). These demonstrations consist only of robot poses, and do not include images. We then use a variety of different video demonstrations, which contain images but not robot poses, to learn the reward function. These videos include demonstrations with other doors, and even demonstrations provided by a human using their own body, rather than through kinesthetic teaching with the robot.

Under review as a conference paper at ICLR 2017

Figure 5: Robot arm setup. Note that our method does not make use of the sensor on the back handle of the door, but it is used in our comparison to train a baseline method with the ground truth reward.

Figure 5 shows the experimental setup. We use a 7-Do F robotic arm with a two-ﬁnger gripper, and a camera placed above the shoulder, which provides monocular RGB images. For our baseline PI2 policy, we closely follow the setup of Chebotar et al. (2016) which uses an IMU sensor in the door handle to provide both a cost and feedback as part of the state of the controller. In contrast, in our approach we remove this sensor both from the state representation provided to PI2 and in our reward replace the target IMU state with the output of a deep neural network.

Figure 6: Rewards from human demonstration only. Here we show the rewards produced when trained on humans only (see Fig. 12). In 6a, we show the reward on a human test video. In 6b, we show what the reward produces when the human hands misses opening the door. In 6c, we show the reward successfully saturates when the robot opens the door even though it has not seen a robot arm before. Similarly in 6d and 6e we show it still works with some amount of variation of the door which was not seen during training (white door and black handle, blue door, rotations of the door).

We experiment with a range of different demonstrations from which we derive our reward function, varying both the source demo (human vs robotic), the number of subgoals we extract, and the appearance of the door. We record monocular RGB images on a camera placed above the shoulder of the arm. The door is cropped from the images, and then the resulting image is re-sized such that the shortest side is 299 dimensional with preserved aspect ratio. The input into our convolutional feature extractor Szegedy et al. (2015) is the 299x299 center crop.

Under review as a conference paper at ICLR 2017

3.2.2 QUALITATIVE ANALYSIS

We evaluate our reward functions qualitatively by plotting our perceptual reward functions below the demonstrations with a variety of door types and demonstrators (e.g robot or human). As can be seen in Fig. 6 and in real experiments Fig. 7, we show that the reward functions are useful to a robotic arm while only showing human demonstrations as depicted in Fig. 12. Moreover we exhibit robustness variations in appearance.

Figure 7: Door opening success rate at each iteration of learning on the real robot. The PI2 baseline method uses a ground truth reward function obtained by instrumenting the door. Note that rewards learned by our method, even from videos of humans or different doors, learn comparably or faster when compared to the ground truth reward.

3.2.3 QUANTITATIVE ANALYSIS

In comparing the success rate of visual reward versus a baseline PI2 method that uses the ground truth reward function obtained by instrumenting the door with an IMU. We run PI2 for 11 iterations with 10 sampled trajectories at each iteration. As can be seen in Fig. 7, we obtain similar convergence speeds to our baseline model, with our method also able to open the door consistently. Since our local policy is able to obtain high reward candidate trajectories, this is strong evidence that a perceptual reward could be used to train a global in same manner as Chebotar et al. (2016).

4 CONCLUSION

In this paper, we present a method for automatically identifying important intermediate goal given a few visual demonstrations of a task. By leveraging the general features learned from pre-trained deep models, we propose a method for rapidly learning an incremental reward function from human demonstrations which we successfully demonstrate on a real robotic learning task. We show that pre-trained models are general enough to be used without retraining. We also show there exists a small subset of pre-trained features that are highly discriminative even for previously unseen scenes and which can be used to reduce the search space for future work in unsupervised steps discovery. Another compelling direction for future work is to explore how reward learning algorithms can be combined with robotic lifelong learning. One of the biggest barriers for lifelong learning in the real world is the ability of an agent to obtain reward supervision, without which no learning is possible. Continuous learning using unsupervised rewards promises to substantially increase the variety and diversity of experience that is available for robotic reinforcement learning, resulting in more powerful, robust, and general robotic skills.

Under review as a conference paper at ICLR 2017

ACKNOWLEDGMENTS

We would like to thank Vincent Vanhoucke for helpful discussions and feedback. We would also like to thank Mrinal Kalakrishnan and Ali Yahya for indispensable guidance throughout this project.

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 1. ACM, 2004.

Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. 2011.

Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. ar Xiv preprint ar Xiv:1610.00529, 2016.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

Ashley Edwards, Charles Isbell, and Atsuo Takanishi. Perceptual reward functions. ar Xiv preprint ar Xiv:1608.03824, 2016.

Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. ar Xiv preprint ar Xiv:1509.06113, 2015.

Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. ar Xiv preprint ar Xiv:1611.03852, 2016a.

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. ar Xiv preprint ar Xiv:1603.00448, 2016b.

A. Joulin, K. Tang, and L. Fei-Fei. Efﬁcient image and video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision (ECCV), 2014.

Mrinal Kalakrishnan, Evangelos Theodorou, and Stefan Schaal. Inverse reinforcement learning with pi 2. 2010.

Sascha Lange, Martin Riedmiller, and Arne Voigtl ander. Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2012.

Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663 670, 2000.

Jan Peters, Katharina M ulling, and Yasemin Alt un. Relative entropy policy search. In AAAI Conference on Artiﬁcial Intelligence (AAAI 2010), 2010.

Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51 (61801):1 4, 2007.

Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729 736. ACM, 2006.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. Co RR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.

Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11(Nov):3137 3181, 2010.

Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pp. 2746 2754, 2015.

Under review as a conference paper at ICLR 2017

Markus Wulfmeier, Dominic Zeng Wang, and Ingmar Posner. Watch This: Scalable Cost-Function Learning for Path Planning in Urban Environments . In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016. arxiv preprint: http://arxiv.org/abs/1607.02329.

Jinhui Yuan, Huiyi Wang, Lan Xiao, Wujie Zheng, Jianmin Li, Fuzong Lin, and Bo Zhang. A formal study of shot boundary detection. IEEE transactions on circuits and systems for video technology, 17(2):168 186, 2007.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pp. 1433 1438, 2008.

Under review as a conference paper at ICLR 2017

A ALGORITHMS DETAILS

A.1 BINARY SEGMENTATION ALGORITHM

Algorithm 2 Greedy and binary algorithm similar to and utilizing Algorithm 1, where Average Std() is a function that computes the average standard deviation over a set of frames or over a set of values, Join() is a function that joins values or lists together into a single list, n is the number of splits desired and min size is the minimum size of a split.

function BINARYSPLIT(video, start, end, n, min size, prev std = [])

if n = 1 then return [], [] end if splits0, std0 SPLIT(video, start, end, 2, min size) if n = 2 then return splits0, std0 end if splits1, std1 BINARYSPLIT(video, start, splits0[0], CEIL(n/2), min size) splits2, std2 BINARYSPLIT(video, splits0[0] + 1, end, FLOOR(n/2), min size) all splits = [] all std = [] if splits1 = [] then

JOIN(all splits, splits1) JOIN(all std, std1) else

JOIN(all std, std0[0]) end if if splits0 = [] then

JOIN(all splits, splits0[0]) end if if splits2 = [] then

JOIN(all splits, splits2) JOIN(all std, std2) else

JOIN(all std, std0[1]) end if return all splits, all std end function

A.2 COMBINING INTERMEDIATE REWARDS

From the two previous sections, we obtain one reward function per intermediate step discovered by the unsupervised algorithm. These need to be combined so that the RL algorithm uses a single reward function which partially rewards intermediate steps but most rewards the ﬁnal one. The initial step is ignored as it is assumed to be the resting starting state in the demonstrations. We opt for the maximum range of each reward be twice the maximum range of its preceding reward, and summing them as follow:

i=2 Ri(a) 2(i 1) (2)

where n is the number of intermediate rewards detected and a an activations vector. An example of this combination is shown in Fig. 8.

A.3 FEATURE SELECTION ALGORITHM

Here we describe the feature selection algorithm we use to investigate the presence of a small subset of discriminative features in mid to high level layers of a pre-trained deep network. To select the most discriminative features, we use a simple scoring heuristic. Each feature i is ﬁrst normalized by subtracting the mean and dividing by the standard deviation of all training sequences. We then rank them for each sub-goal according to their distance zi to the average statistics of the sets of positive and negative frames for a given goal:

zi = α µ+ i µ i (σ+ i + σ i ), (3)

Under review as a conference paper at ICLR 2017

Figure 8: Combining intermediate rewards into a single reward function. From top to bottom, we show the combined reward function (with range [0,2]) followed by the reward function of each individual steps (with range [0,1]). The ﬁrst step reward corresponds to the initial resting state of the demonstration and is ignored in the reward function. The second step corresponds to the pouring action and the third step corresponds to the glass full of liquid.

where µ+ i and σ+ i are the mean and standard deviation of all positive frames and the µ i and σ i of all negative frames (the frames that do not contain the sub-goal). Only the top-M features are retained to form the reward function Rg() for the sub-goal g, which is given by the log-probability of an independent Gaussian distribution over the relevant features:

(sijt µ+ ijt)2

σ+ ijt 2 , (4)

where ij indexes the top-M selected features. We empirically choose α = 5.0 and M = 32 for our subsequent experiments. At test time, we do not know when the system transitions from one goal to another, so instead of time-indexing the goals, we instead combine all of the goals into a single time-invariant reward function, where later steps yield higher reward than earlier steps, as described in Appendix A.2.

Table 3: Reward functions accuracy by steps (Jaccard overlap on test sets).

dataset method steps (testing) step 1 step 2 step 3 average door random baseline 40.8% 1.0 26.3% 4.1 - 33.6% 1.6 2-steps feature selection 85.1% 0.0 59.7% 0.0 - 72.4% 0.0 linear classiﬁer 79.7% 6.0 70.4% 5.0 - 75.0% 5.5 door random baseline 20.8% 1.1 31.8% 1.6 23.8% 2.3 25.5% 1.6 3-steps feature selection 56.9% 0.0 47.7% 0.0 54.1% 0.0 52.9% 0.0 linear classiﬁer 46%6.9 47.5% 4.2 67.2% 3.3 53.6% 4.7 pouring random baseline 39.2% 2.9 22.9% 3.9 - 31.1% 3.4 2-steps feature selection 76.2% 0.0 54.6% 0.0 - 65.4% 0.0 linear classiﬁer 78.2% 2.4 60.2% 1.7 - 69.2% 2.0 pouring random baseline 22.5% 0.6 38.8% 0.8 13.9% 0.1 25.1% 0.1 3-steps feature selection 32.9% 0.0 55.2% 0.0 32.2% 0.0 40.0% 0.0 linear classiﬁer 72.5% 10.5 37.2% 11.0 39.1% 6.8 49.6% 8.0

A.4 PI2 REINFORCEMENT LEARNING ALGORITHM

We chose the PI2 reinforcement learning algorithm (Theodorou et al., 2010) for our experiments, with the particular implementation of the method based on a recently proposed deep reinforcement learning variant (Chebotar et al., 2016). Since our aim is mainly to validate that our learned reward functions capture the goals of the task well enough for learning, we employ a relatively simple linear-Gaussian parameterization of the policy, which corresponds to a sequence of open-loop torque commands with ﬁxed linear feedback to correct for perturbations, as in the work of Chebotar et al. (2016). This policy has the form π(ut|xt) = N(Ktxt + kt, Σt), where Kt is a ﬁxed stabilizing feedback matrix, and kt is a learned control. In this case, the state xt corresponds to the joint angles and angular velocities of a robot, and ut corresponds to the joint torques. Since the reward function is evaluated from camera images, we assume that the image is a (potentially stochastic) consequence of the robot s state, so that we can evaluate the state reward r(xt) by taking the image

Under review as a conference paper at ICLR 2017

It observed at time t, and computing the corresponding activations at. Overloading the notation, we can write at = f(It(xt)), where f is the network we use for visual features. Then, we have r(xt) = R(f(It(xt))).

The PI2 algorithm is an episodic policy improvement algorithm that uses the reward r(xt) to iteratively improve the policy. The trust-region variant of PI2 that we use Chebotar et al. (2016), which is also similar to the REPS algorithm (Peters et al., 2010), updates the policy at iteration n by sampling from the time-varying linear-Gaussian policy π(ut|xt) to obtain samples {(x(i) t , u(i) t )}, and updating the controls kt at each time step according to

i u(i) t exp

t =t r(x(i) t )

t =t r(x(i) t )

where the temperature βt is chosen to bound the KL-divergence between the new policy π(ut|xt) and the previous policy π(ut|xt), such that DKL(π(ut|xt) π(ut|xt)) ϵ for a step size epsilon. Further details and a complete derivation are provided in prior work Theodorou et al. (2010); Peters et al. (2010); Chebotar et al. (2016).

The PI2 algorithm is a local policy search method that performs best when provided with demonstrations to bootstrap the policy. In our experiments, we use this method together with our learned reward functions to learn a door opening skill with a real physical robot, as discussed in Section 3.2. Demonstration are provided with kinesthetic teaching, which results in a sequence of reference steps ˆxt, and initial controls kt are given by kt = Ktˆxt, such that the mean of the initial controller is Kt(xt ˆxt), corresponding to a trajectory-following initialization. This initial controller is rarely successful consistently, but the occasional successes it achieves provide a learning signal to the algorithm. The use of demonstrations enables PI2 to be used to quickly and efﬁciently learn complex robotic manipulation skills.

Although this particular RL algorithm requires demonstrations to begin learning, it can still provide a useful starting point for real-world learning with a real robotic system. As shown by Chebotar et al. (2016), the initial set of demonstrations can be expanded into a generalizable policy by iteratively growing the effective region where the policy succeeds. For example, if the robot is provided with a demonstration of opening a door in one position, additional learning can expand the policy to succeed in nearby positions, and the application of a suitable curriculum can grow the region of door poses in which the policy succeeds progressively. However, as with all RL algorithms, this process requires knowledge of the reward function. Using the method described in this paper, we can learn such a reward function from either the initial demonstrations or even from other demonstration videos provided by a human. Armed with this learned reward function, the robot could continue to improve its policy through real-world experience, iteratively increasing its region of competence through lifelong learning.

B ADDITIONAL QUALITATIVE RESULTS

Under review as a conference paper at ICLR 2017

Figure 9: Qualitative examples of unsupervised discovery of steps for door and pouring tasks in training videos. For each video, we show the detected splits when splitting in 2, 3 or 4 segments. Each segment is delimited by a different value on the vertical axis of the curves.

Under review as a conference paper at ICLR 2017

Figure 10: Qualitative examples of reward functions for the door task in testing videos. These plots show the individual sub-goal rewards for either 2 or 3 goals splits. The open or closed door reward functions are ﬁring quite reliably in all plots, the hand on handle step however can be a weaker and noisier signal as seen in 10b and 10c, or incorrect as shown in 10e. 10d demonstrates how a missed action is correctly recognized.

Under review as a conference paper at ICLR 2017

Figure 11: Entire testing set of pouring reward functions. This testing set is designed to be more challenging than the training set by including ambiguous cases such as pouring into an already full glass (11i and 11j) or pouring with a closed bottle (11g and 11h). Despite the ambiguous inputs, the reward functions do produce reasonably low or high reward based on how full the glass is. 11a, 11b, 11b and 11d are not strictly monotonically increasing but do overall demonstrate a reasonable progression as the pouring is executed to a saturated maximum reward when the glass is full. 11e also correctly trends upwards but starts with a high reward with an empty glass. 11f is a failure case where the somewhat transparent liquid is not detected.

Under review as a conference paper at ICLR 2017

Figure 12: Entire training set of human demonstrations.